OA TEI-XML DH on the WWW; or, My Guide to Acronymic Success

Data management is not a sexy term. Despite its importance to modern research, it remains unlikely to crop up naturally in a conversation between historians. Indeed, in the UK, the Arts and Humanities Research Council has replaced the term entirely with the equally uninspiring technical plan, an oddly self-conscious attempt, it seems, to make the concept more modern and more in keeping with its growing, and much appreciated, support of the digital humanities.

Data management within the Digital Humanities (DH) takes many forms. For my own research, it takes the form of extensible mark-up language (XML), validated through the Text Encoding Initiative (TEI), and made open access (OA) on the World Wide Web (WWW). OA TEI-XML DH on the WWW.

It begins with a transcription, or rather, a very large number of very poorly organised transcriptions. These, at best, are literal, plain-text reproductions of the character-for-character textual information contained within the newspaper—or example, the following snippet of text from the Glasgow Herald, 28 January 1822:

Lima is what an Englishman would call a dirty colonial town; 6000 inhabitants is the outside of its population; the whites are about 1200 Europeans, and not as many born in the country; the rest are blacks and mulattoes. Of the mulattoes there are eight various shades. I never saw such a wretched herd as Lima incloses; more poverty and misery than in any town of the size in the world.–There are some few splendid houses, but the rest so disgustingly filthy that you cannot go into them without being covered with dust and vermin. The streets, as in all American towns, run at right angles; a powerful stream of water in each; the town is capable of great improvement. The climate must be bad, for the whole population look as if turned out of the hospitals for a day’s air; a half-born race–;melancholy in their faces. On the whole coast of Peru we made the same observation; the ague very general, with fever–their habits and manner of living increase the evil; thin clothing and vegetable diet–nights very cold,

The textual information is seemingly complete and the record can be located by searching for individual terms and phrases, either within a long, multi-record document— a single file containing many transcriptions, making use of page breaks or headings for delineation—or amongst a large number of individual records—a directory containing a collection of individuals files, each containing a single plain-text record. But this method of data management relies upon many assumptions, namely that the user will either being searching for a particular record, the textual content of which is largely known, or that they will search for a particular type of record that can be identified through a predictable set of search terms, something easier said than done.

When read by a human, it is clear that this article forms part of a lengthy, highly unflattering, account of Lima, Peru, and the wider Spanish Empire. It falls neatly under what may be termed ‘The Black Legend‘ of Spanish imperialism. Yet, the article was initially obtained by searching for the terms ‘Indian’ and ‘America’ as part of a project on portrayals of the American (United States) frontier in the Scottish press. While both these terms are present in the text, the article was not relevant to that project. Indeed, my most cleverly devised searches resulted in the following two outcomes

  • only about 40% of the articles that my searches (of my own database) returned were in relevant to my current project
  • at least three very important (argument-shaping) articles were only identified during routine tagging of my database backlog—a number I expect will rise as I endeavour to encode my remaining 400 transcriptions.

Likewise, if I were to make my transcriptions freely available online in plain text form, it is highly unlikely anyone studying the “Black Legend”, or any of the very intriguing claims made by its author regarding the character or trajectory of the Spanish Empire, would stumble upon it— if you are such an individual, it is record 367.

I could organise my individual records into thematic files or directories, but as any given article could be interpreted in many different ways, and these themes overlap in a seemingly infinite number of combinations, such as methodology is inefficient if not wholly ineffective. Instead, the inclusion of XML elements, identifying key aspects of the text in a standardised way, is allowing me to pull records together in new and unexpected ways. By using a standardised (if imperfectly aligned) set of encoding rules (TEI), my databases are instantly mine-able and manipulatable by any other academic, even I become unavailable.

The above text has now been encoded for the following TEI elements:

  • font-weight and style (hi)
  • text justification (hi)
  • line breaks and soft-hyphenation (lb)
  • bibliographical information for the periodical (title, pubPlace)
  • publication information for the issue (when)
  • whether it is a reprint or original text (derivation)
  • placement of the article within the issue (pb and cb)
  • topics (keywords)

In due course, it will be also be be encoded for

  • named or otherwise referenced individuals (persName) and groups (org)
  • named or otherwise referenced locations (place)
  • previous publication information (dateline)

I had also hoped to encode datelines—the textual clues explaining whence the material was obtained, including the location and periodical title—but this has proved difficult within the standard TEI structure and will need to be researched further.

Once encoded, XML’s little sister XSLT come into play. XSLT, or EXtensible Stylesheet Language Transformations, is, more or less, a CSS, or Cascading Style Sheet, for XML. For those unfamiliar with website construction, both XSLT and CSS tells your computer how to display (or save) the information in your data (XML or HTML) file. This might mean displaying only certain fields—if I want an HTML list of bibliographic records, excluding the text itself—or filtering based on a particular values—such as creating a new XML database containing only those articles that were originally printed in the Glasgow Advertiser and that discussed Native Americans between 1790 and 1800.

This filtering functionality is the most common use I have for XSLT at the moment, as I flit from project to project, re-purposing and expanding collections I have gathered over the past ten years. But, XSLT can serve very different aims as well. Recently, Shawn Graham, a scholar and a gentleman from Carleton University, enquired about my database’s construction in an effort to transform it into a CSV (comma separated value) file for analysis with MALLET—a topic modelling programme with which I had only a vague familiarity. In ten minutes, and 40 lines of XSLT code, I transformed my XML database by

  • removing all the commas from the article text
  • creating a header row listing all the possible data fields in my database
  • populating the data fields for the individual records

This allowed him to create some—frankly wonderful— topic models based on my data, which he kindly shared with me. Similar files could also be created for analysing the relationship between certain themes or publishers and the word count of the article, its placement within an issue, its dateline or its other theme—none of which I may have considered relevant to my current project but which, through a small amount of extra effort (and the kindness of strangers) has opened up countless possibilities.

Moreover, this extra effort should not really be considered extra to a scrupulous historian. It is merely one way of formalising those processes that should be part and parcel of any analysis—namely making darn sure I know the precise individual or place a word is referring to, and the possible implication that this has for the source at large, even if it is only mentioned in passing. Too easily we are blinkered by the aims of our current project, often by pressing grant or institutional deadlines, and fall into strategic note-taking and analysis. By accepting that my data management techniques will take slightly longer than traditional annotation, and seem slightly tangential to outside viewers, I hope that I am facilitating a work-flow that is both academically rigorous and socially responsible.

It is thus with the aim of improving close reading, rather than merely facilitating distant reading, that I have instituted TEI-lite, or basic XML encoding, in my first-year digital history module this spring, the subject of my next.

 

**Image Courtesy of Cole

One thought on “OA TEI-XML DH on the WWW; or, My Guide to Acronymic Success

Leave a Reply