In 2014, I began to convert my newspapers transcriptions–stored haphazardly in a variety of plain text and Word documents, as well as Access, OneNote and Evernote databases–into XML. The value of structured data was immediately apparent, but the means for structuring it were not. My first attempts were as haphazard as my previous storage solutions. XML has the virtue and danger of being almost infinitely customisable. Through my qualitative research, I had developed a clear sense of what was important, and how those important aspects related to one another. I created my own vocabulary and hierarchy of values. In its original form, a typical entry was as follows:
<line>This morning, some dispatches were received </line>
<line>from Boston, which were brought over in the </line>
<line>Sally, Capt. Walker from Liverpool. They are </line>
<line>dated the 16th November, and contain an ac<sh/></line>
<line>count of the arrival of several ships from New<sh/></line>
<line>foundland, and some from Halifax; that pro<sh/></line>
<line>visions of all kinds were very plentiful and rea<sh/></line>
<dateline year=”1788″ month=”January” day=”16″ city=”Boston” province=”Massachusetts” country=”United States” international_transport=”The Sally, Captain Walker”/>
<mention city=”Halifax” province=”Nova Scotia” country=”Canada”/>
<mention province=”Newfoundland” country=”Canada”/>
<mention city=”Halifax” province=”Nova Scotia” country=”Canada”/>
As this work progressed, I became increasingly aware of the potential value of *conforming* to a more standard XML schema. I had long been aware of the Text Encoding Initiative, but had been uncomfortable with its chosen hierarchies, which seemed better suited to literary criticism than historical analysis. They felt ill-suited to the project at hand, particular for encoding non-standard elements of 18th-century newspapers. The dateline, in particular, could not be encoded in an intuitive way within the constraints of the P5 schema. Nor was I satisfied with developing a project-specific addendum to the standard hierarchies. In the end, I bent my dataset to fit the basic contours of the framework. This led, in some cases, to an egregious over-nesting of some metadata fields, and a creative interpretation of some of the core elements, but I was largely satisfied that my data was encoded in both a meaningful and externally comprehensible way. The most recent form of my XML is as follows:
<authority>Collected from a digital image by <persName role=”collection” n=”mhbeals”>M. H. Beals </persName>. Transcribed by <persName role=”transcription” n=”mtempler”>Max Templer.</persname>
<licence n=”https://creativecommons.org/licenses/by/4.0/legalcode“>This work is licensed CC-BY under a Creative Commons Attribution 4.0 International License</licence>
<title n=”newspaper_ga”>Glasgow Advertiser</title>
<authority>Text within the public domain.</authority>
<country n=”2635167″>United Kingdom</country>
<interaction type=”none” active=”corporate” passive=”world”/>
<p><pb/>This morning, some dispatches were received from Boston, which were brought over in the Sally, Capt. Walker from Liverpool. They are dated the 16th November, and contain an account of the arrival of several ships from Newfoundland, and some from Halifax; that provisions of all kinds were very plentiful and reasonable.
As you can see, the core elements have largely been retained, including the publication and transcription details, locations mentioned within the text and keywords based on close reading. The encoding of the dateline remains, however, a thorny issue–one that has yet to be satisfactorily resolved. Converting to the P5 framework also required the inclusion of the profileDesc element, with its sub-elements regarding the nature and provenance of the text. This was both an annoyance and an opportunity. The questions poised here, such as constitution and domain, were ones that had not been part of my original analytical framework. However, as one reuser of my data has demonstrated, they are possibly very valuable analytical hooks–if they are consistently and rigorously encoded.
This was, in the grand scheme of my research, only a small adjustment to my analytical framework. The next step on my journey, however, forced me to fundamentally reconceptualise my entire approach. At this year’s European Social Science History Conference, I completed my emigration from the Ethnicity and Migration network to that of Spatial and Digital Humanities—a process I begun in 2014. Over the course of these four days, the means and rationale for converting (or at least regularly translating) my XML database into RDF (Linked Data) became apparent. While not exactly a Road to Damascus moment, the structure of RDF suddenly made sense to me. What had previously seemed like an equal but distinct modelling solution now appeared to solve many of the ontological conflicts I had had with TEI.
Over the next few weeks, I set about transforming my database. It quickly became apparent that the existing transformers had the very unappealing side-effect of rendering my structure less rather than more specific. If I were to solve my ontological crises, I would have to do so from scratch. Fortunately, unlike TEI, RDF is exceedingly accepting of Frankenstein ontologies—that is, combining a variety of existing schemas into a project-appropriate collection of descriptors. Scouring the web for existing vocabularies, I developed an eclectic mix of terms with which to describe my data. Most were simple to map onto my existing structure with little or no revision. One, however, made me pause: FaBiO, the FRBR-aligned Bibliographic Ontology.
FaBiO provides a vocabulary for describing how different versions of an object relate to one another. At the most abstract level is the work, followed by expressions thereof, manifestations of the expressions and exemplars of the manifestations. It was at this point that I began to question how exactly I had conceived my transcription project on a philosophical level. On the one hand, I could have jury-rigged the vocabulary to meet my needs, but somehow that did not feel robust.
To put it more concretely, I had two “physical” versions of each newspaper article, the digital image and my transcription. A simple connection between the two would seemingly suffice. However, that digital image was merely an exemplar, an example, of a hard-copy, original version of that work. Moreover, that original hard-copy might be a single variant of that issue of the newspaper, depending on the number of runs or editions that had been undertaken. So at this stage I had four versions to connect: The issue of the newspaper, its particular edition or run, the digital image of that run and my transcription of that digital image. The tricky part was the next level up.
XML, or rather TEI, had not provided a sufficient solution to my dateline problem. Nor had it provided a simple way to cross-reference reprints (similar articles printed by different newspapers) within my database. I had forced a reference through a hidden element at the start of each article, but this was hardly satisfactory. FaBiO, on the other hand, offered an intriguing solution. What if I conceived of the work as the archetypal version of that newspaper article, the platonic ideal of the meme?
Thus, the meme (Work) was realized by its appearance in a newspaper issue (Expression). Both of these were abstract notions, rather than concrete, physical objects. The idea of the article and the idea of the form or wording it would take in a particular newspaper. This expression was then embodied (given concrete form) in the specific edition or version of that issue (Manifestation). This was the first physical object I could cite or point to directly through a catalogue record. This issue then had several exemplars, namely the digital image that was created from it and the transcription I had taken in XML.
At this stage, this may not seem like a revolutionary description of digitized materials. Choices regarding editions and indeed copies (with unique marginalia and deformations) have long been part of debate surround historical newspapers. Likewise, those studying folklore and modern internet memes continue to struggle with how to conceive and describe the platonic ideal of Cinderella or the ‘Why you no’ face.
But on a personal level, the very practical act of making my research data intelligible to the semantic web has fundamentally changed and, for the first time, solidified my understanding of philosophical underpinnings of periodical reprints. It’s amazing what necessity can give birth to.
**Thumbnail image courtesy of Erich Ferdinand