Digging into Data: the research work plan
Gathering and gazetting results & building tools
1. The first stage of the project is to mine the EE correspondence dataset (consisting of letters together with associated documents and extensive annotation) for geographical data, i.e. names of places from the country to personal address level. This information occurs in multiple locations, and may refer to where a letter was written, where it was sent to and the route taken, as well as numerous geographical references within the texts of the letters themselves and their annotations.
2. From the collated data EEP will build diachronic and multilingual thesauri of word-forms. At present, location names exist in EE in up to nine languages (English, Dutch, French, German, Greek, Italian, Latin, Russian and Spanish), and come from historical documents covering three centuries (17th–19th centuries). The thesauri will thus include period-specific abbreviations and colloquialisms for locations over Europe, Asia, the Americas and Oceania in a significant range of European languages. Existing metadata of locations, currently structured to city level, will be used to analyze existing data for the occurrence of elements; this will be used to build a table of geographical token words. Systems applying Soundex and variants (stemming the "key" linguistic term for each location "digital object") will then be used to create additional fields of token variants and stems.
3. The results of this analysis will be used to build a "crawler" to recurse through the EE dataset, identifying instances matching our tokens. The project will need to implement a concordance structure (at sentence and paragraph level) to take a certain number of words before and after the match, which we can potentially use to focus the match for accuracy and disambiguation.
4. The EE finished token list will be mapped against standardized authority lists such as the Getty Thesaurus of Geographic Names (TGN) for geographical names, in order to provide a public gazetteer of locations. (EEP will submit any new or altered data to TGN as part of their contributions scheme, the Getty Vocabularies Program.) This will enable the project to build and test methods, gazetteers and tools to allow users to identify, define and link more data from EE’s and other datasets, as well as being fed into Improvise and used for overlay mapping (static and dynamic systems).
5. The following elements are envisaged:
- Build a geographical gazetteer from EE geographical metadata mapped to the Getty TGN.
- Build a multi-lingual thesaurus and colloquial place-name forms.
- Build a parsing system and run the list of location names from the combined gazetteer against the dataset
- — combine and enhance existing open-source text processing libraries in conjunction with EE's existing MySQL infrastructure
- — incorporate a concordance structure (at sentence/paragraph level) to take n words before and after the match
- Build a geographical research interface (with a geographical suggestion system) to enable users of EE to access the dataset by aspects of location information (e.g. location mentioned in a document, or a document written from a location, or a person born there).
- Digitize key items from the Bodleian Libraries' rich holdings of historical maps, to enable users of EE to access visual representations of these movements across time and space.
- Write up the results in terms of methodologies.
- Build tools for geographical parsing of datasets, with additional functionality for historical documents of the 17th–19th centuries and a range of European languages (extensible for other historical datasets).
6. The outcomes listed above will be made publicly available. By their nature they are open to extension chronologically and linguistically, and can be used as templates for the analysis and mapping of a wide range of other data.