Integrating Historical Data
(PAS)
Document Contents
When working with historical data, and especially when integrating such data, it is important to understand the difficulties which are involved, and how historical content differs from most other content which is integrated on the Web these days. This understanding is important not least because it will help to appreciate the possibilities and limitations of the Cluster and the material it contains.
Challenges: Historical Uncertainty and Inconsistency
Most integration and Web services projects have data which is very clearly structured, highly consistent, and complete. Think, for example, of inventories such as that at Amazon.com: at least in general the data for each book at Amazon is consistent, and we would not expect any information to be missing. In contrast, historical material, and particularly material from the early Middle Ages, is often very inconsistent, imprecise, or absent entirely. This manifests itself in several ways, as discussed in the following sections.
Inconsistency of Representation
One obvious issue is consistency across resources, for example in spelling. Consider the recipient of the grant recorded in S 611. ESawyer describes this 'King Eadwig to Beorhtnoth, his faithful princeps'. However, if we search for 'Beorhtnoth' as recipient, we do not find our charter. Why not? Because PASE represents this individual as Byrhtnoth, since it follows different conventions for normalising personal names. The problem is legion, particularly for kings who have so many possible names. Consider king of England who was defeated by the Danish in the early eleventh century and who is now known as the 'unready'. What should we call him if we want to search across the different projects? 'Athelred'? 'Æthelred'? 'Æthelred the Unready'? 'Athelred the Unready'? 'Æthelred II'? If one project calls this king 'Athelred' and the other 'Æthelred', then we cannot use one search across both projects. Similarly, one resource may refer to a manuscript as 'Public Record Office, Charter Rolls, 8 Henry IV', another resource may call the same manuscript 'National Archives, C 53/176', and a third 'National Archives, C 53/176 (olim PRO Ch.R. 8 Hen. IV)'.
Uncertainty and Imprecision
An enormous problem with searching and integrating early medieval content is that of uncertainty and imprecision, a good example of which is dates. It would be ideal if we could search for charters which were issued during a particular period, or search for manuscripts which were written at a particular date. In some cases this is straightforward: Charter S 960, for example, survives in an authentic copy and was issued in A.D. 1023. In other cases, the charter does not specify the date in which it was issued, and so we must infer it from the content: for example, we know that S 984 must have been issued some time in the period betwee 1020 and 1022 (if indeed it is genuine). This is still fairly straightforward: if we allow a range of possible dates then we can still find our charter. But what of S 238, which claims to have been issued in A.D. 663 but is probably an unreliable copy for a charter which was perhaps really issued in 693? Consider also S 294a, which claims to have been issued in 814 but has a witness-list which could have been from either 855 or 844; either way it is almost certainly a forgery which was produced much later than any of these dates. Similar problems hold for the dates of the manuscripts in which the charters are preserved, since these can be dated with even less precision, and scholarly opinion can vary widely as to the date of a given manuscript (and indeed in different ways of expressing those dates). Once again, then, any resource which includes dates must somehow cope with this imprecision and uncertainty. Even with only one resource, this content is difficult to search or sort. How does one design a user-interface to search for S 294a by date? How would you sort the charters discussed in this paragraph by date? The problems are compounded when different projects are integrated: not only might the projects make different decisions about searching and sorting, but they may even give different dates for the same charter (for example). This makes cross-searching by date almost impossible.
Incompleteness
Another form of imprecision is incompleteness of data. Once again this takes several forms: in the simplest case, it may be that we simply do not know a piece of information about a particular record (the date of issue for a charter, for example). However, the problem is again compounded in the Cluster. As discussed in Cluster Content, not all the resources contain complete information for the entire corpus. ASChart, for example, only includes charters issued before A.D. 900 (this alone is more complex than it seems when we consider the preceding discussion about uncertainty and dating). This means, however, that any search for a type of clause will not return any results for charters issued after A.D. 900. If we search the Cluster for charters containing curses issued by King Cnut (who reigned 1016–1035), then we will find no results. This is not because King Cnut never included curses in his charter, but because the charters of Cnut are not included in ASChart and therefore the Cluster has no knowledge of any clause types for his charters.
Instability
A further problem with historical content in general, and the Cluster in particular, is stability of data. In many integrated systems, a record is relatively stable: a book on Amazon is unlikely to change its title, for example. However, as already discussed here, historical data is constantly changing as scholars revise their views. Even the texts of charters change as new editions are produced (compare, for example, the older texts of ASChart with the newer ones of eSawyer). Perhaps more significantly, even Sawyer numbers themselves can occasionally change: as scholars re-edit the texts, they may decide that what was once considered one charter and therefore has one Sawyer number is better regarded as two separate charters and so the second is given a new number. This happens fairly rarely in practice, but it does happen: if you use eSawyer to Browse Charters by Sawyer Number in eSawyer then you will find several of the form S 103a, S103b, and so on. To a large extent the Cluster depends on any one charter having the same Sawyer Number, but the very notion changes of what is a single charter, and if different projects give different answers to that question, then integration of that data becomes very much more complex.
Advantages
Despite the challenges discussed above (and many others, as given elsewhere in this report: see especially Limitations), there are some advantages that the Cluster has over traditional web services and integrated projects. For one, the quantity of data in the system is relatively small. There are only about 2,000 Anglo-Saxon charters, as defined by the Cluster, and the amount of information we wish to exchange about them is quite small as well. Despite issues of scholarly instability, the content is also relatively fixed for the long term: data is stable generally for months at a time, and often for years, whereas a source such as Amazon adds and updates very many records every day. Security is not a great issue, as noted above: the Cluster does not need to provide functionality to change information, and all of its data is publically available. Finally, at this stage the Cluster itself and all of the constituent projects are based at the same place (CCH), meaning that we as developers have detailed knowledge of all the resources, and that we can modify them as necessary.