Over the past year, I have been transferring my collection of newspaper transcriptions—held in a variety of formats—into a single XML database. The reason I chose XML, rather than a series of nested folders or any other organizational format, was its ability to be reformatted, and re-purposed, in a variety of ways without duplicating or damaging the source material. The process has not been painless, and is as yet far from complete, but as my own experience may prove useful to others undertaking large-scale transcription, or considering reviving old data into new forms, I will lay it before you along with some initial reflections.
The State of My Collection
In January 2014, my collection of newspaper transcriptions contained roughly 1000 separate newspaper articles, largely, but not exclusively, from Scottish newspapers, and largely, but not exclusively, from the late-Georgian period, 1783-1843. The individual articles existed in the following formats:
- Word (.doc) Documents
- A box of photocopies
- Word (.docx) Documents
- An Excel (.xsl) File
- Evernote Entries
- Text (.txt) Files
- A Basic XML Database
A further 500-750 articles existed on my computer as images, with limited metadata and without a machine-readable transcriptions.
The Evolution of My Collection
These formats do not represent discrete sub-collections but rather a series of evolving, overlapping corpora, mirroring changes in my research interests, my skills as a (digital) historian, and my access and understanding of digital tools over the past nine years. The evolution was, roughly, as follows:
2005-2007: Partial Word (.docx) Transcriptions
As part of my postgraduate research into domestic perceptions of emigration in southern Scotland, I began taking notes from provincial newspapers in Scottish Borders. The articles were transcribed
- sometimes in part, sometimes in whole, with […] to indicate omitted sections
- without regard for line breaks
- with regard for text justification
- with inconsistent regard for font weight, italics or specialised typeface, such as ‘small caps’
All articles from a single newspaper title were collected in a single Word Document, with page breaks between individual articles. Each page was headed with basic bibliographic information, including the date and page number(s).
Articles were chosen by subjective browsing, based on my perception of their relevance to my current project:
- The article is explicitly about emigration or settlement
- The article contains information that would be useful to those considering emigration, such as living and working conditions in Britain’s major settler colonies and the United States
- Anecdotes or humour relating to the United States or settlement colonies that might colour readers’ perceptions of those places as suitable locations for immigration
- Shipping advertisements to places of settlement, specifically those mentioning emigration or passage explicitly
- Miscellany that made me laugh and garner dagger eyes from the librarian on duty
Machine-readable transcriptions were taken either directly from original hard-copies or from printed microfilm and microfiche images. Once the articles were transcribed, printed images of the text was discarded (with some exceptions, found at a later date).
This method of collection and transcription served me well during my PhD, allowing me to collect and understand a representative sample of qualitative information, in context, from a relatively small collection of newspaper titles. My aim was to understand the range, and relative frequency of discussions in the regional newspaper press, and thus this type of note-taking was sufficient. However, my naive lack of documented search parameters, and indeed the incompleteness of my transcriptions made their reuse in other contexts dubious.
2007-2008: Access 2007 Transcription Database
In an attempt to better code my transcriptions, and thereby create sub-collections of articles, letters and other primary material on certain themes, I transferred my existing Word transcriptions into a self-designed Access database. This included fields for bibliographic information as well as basic ‘topic’ and ‘entity’ coding, though bluntly applied to the entire source rather than to particular phrases or paragraphs. The method of selecting articles remained subjective browsing, largely derived from hard-copy and microform holdings. No images of the original texts were retained, though Access did have some functionality for attaching image files. The original 2005-2007 Word documents were retained but not updated.
The fact that I maintained Word files, rather than discard them in the upgrade, proved extremely fortuitous. Although later versions of Access had rich text editing, much of the original formatting was lost in the transfer from one database to another. In the absences of photographic copies, my Word approximations proved useful guides for subsequent research.
2008-2009: OneNote Database
At the conclusion of my PhD, I began to work with a much wider range of Scottish newspapers, including those digitised by the British Library and the Scotsman. The use of these online databases as well as the purchase of a flat-bed scanner meant it was now possible to quickly obtain and retain images of the original text alongside machine-readable transcriptions.
Here my collection methodology shifted rapidly. I worked through the Glasgow Herald, a very short run of which was digitised, as I had my Borders newspapers, browsing page by page. The other titles, however, were interrogated using the keyword search facilities for a wide range of terms that I had discovered through my previous browsing habits. Although I had learned the lesson of recording my search parameters to ensure consistency throughout the research process, I had not yet fully understood their relevance and value to subsequent uses of my data. These Boolean strings have, therefore, been largely lost. My collection, therefore, now consisted of data from two quite different collection methodologies with little documented evidence of which pieces had come from which projects or search parameters. While a general knowledge of this exists in my memory, this is a precarious storage method.
The rapid retrieval of these images also made keeping up with machine-readable transcriptions difficult. At the suggestion of colleague, I began to use Microsoft OneNote, which offered a rudimentary OCR transcription of images, as well as intuitive image clipping and date-stamping. Individual articles were stored in individual notes, titled with bibliographic information and containing the image and transcription. Although URLs were collected automatically by OneNote, these were session-based addresses and could not be reused at a later date, a fact I learned to my great annoyance some six months later.
Transcriptions from the Access Database were not integrated into the OneNote system, but maintained separately until 2009.
2009-2013: Evernote Database
By this stage of my career, several months into my new role at the History Subject Centre and working in collaboration with a new (and very helpful) set of colleagues, I discovered a range of novel research tools. Evernote, which (at the time) offered a sleeker interface and a significant amount of cloud storage, quickly replaced OneNote in my workflow. Its automatic indexing made the searching of images very simple, even when full transcriptions had not yet been completed. Although it did not provide me with full, machine-readable transcriptions, OneNote’s OCR had also never been sufficient, and manual transcriptions or extensive corrections had always been required.
Evernote’s integrated ‘Web Clipper’ made collecting images from online databases, now the main focus of my research, very straightforward, as did the ability to quickly import my existing Word Documents, Access Databases, PDFs and OneNote transcriptions. As with OneNote, I titled notes with bibliographic data. I could now also easily tag individual notes with bibliographic fields as well as with the coding developed for my Access database, which had been excluded from the OneNote database.
While the article-level coding was very useful for my continuing qualitative research, it was, in retrospect, a very blunt instrument. With some article several column long, determining the relevant keywords for any given work was a highly subjective process, as was the creation of new keywords to describe new, or at least newly noticed, themes and topics. This again added a layer of inconsistency to my database. Although it was now all housed in a single repository with significantly improved search facilities, the differences in collection, recording and encoding where becoming ever more apparent. Use of articles first encountered during my PhD now seemed unreliable and had to be included with great caution. Moreover, the ease of collecting new images, and the availability of a pseudo-OCR search facility, hampered careful transcriptions of new articles. As full transcriptions were not needed for my immediately qualitative research, they were often put-off, add further inconsistencies to my collection.
2012-2013: Text Files, Excel Manifests and Evernote Images
Once my research turned towards the mapping reprint networks, the search facilities offered by Evernote were no longer sufficient for text or metadata analysis, although its storage of image information did continue to be very useful. At this stage, my transcription databases diverged into three separate forms.
- Evernote notes, containing images and manual transcriptions with full formatting, and tagged with basic bibliographic and coding tags
- Plain text files, comprised of machine-readable transcriptions that had been processed to remove capitalisation and non-alpha characters, for use in text-comparison software
- An excel document listing the full bibliographic and contextual metadata for each article
These three sets of information were related, but fulfilled different roles in different stages of my research. At first, maintaining these three sets of information was manageable, using a simple Python programme to uniformly transform my Evernote transcriptions into plain text and a simple cut-and-paste of Evernote’s note list as the starting point for my metadata spreadsheet. Straightforward though it was, the process was also very uneven; I often spent significant time catching up on metadata, transcriptions or coding, despite efforts to maintain a consistent workflow.
2014: XML Revelations
I cannot say precisely when the penny dropped, so to speak. It may have been my need to constantly flip back and forth between metadata spreadsheets, formatted transcriptions and reprint comparisons when composing my article on the the Sydney Gazette. It may have been my annoyance at having to continuously reformat my transcriptions for use in printed publications, presentation slides or my blog. It was likely related to my frustration of trying to find a particular typographical or bibliographical marker and only having one of my three databases to hand or properly updated. I decided that a more robust transcription database was required.
Having had the opportunity to briefly look behind the curtain of the British Library’s 19th Century Newspaper Database, I found it use of XML intriguing. Despite being overly complicated for my purposes, the basic idea of storing texts in this way seemed to have some merit. Having already learned the value of separating content and formatting when creating webpages, the idea of doing the same for research materials was not too far-fetched. The fact that XML was extendable—that is, that I could create my own fields to suit my own research needs—made it even more attractive. As my interest in XML increased, I became increasingly familiar with the Text-Encoding Initiative, and the level of detail that could be encoded alongside the original content, though I remain far from expert in its use.
There were thus many good, scholarly rationales for storing machine-readable transcriptions of historical texts in XML. Unlike images, they were truly full-text searchable and unlike plain transcriptions, XML data could include rich metadata and textual coding. It could also include complicated formatting and typographical information in a way that could be used to recreate the original presentation of the text but be easily disregarded when irrelevant.
Most importantly, through the use of XSLT, I could transform the database into plain text, HTML, or individual formatted documents, including as many or as few entries as I required, displaying them and their metadata in precisely the way I needed at that particular moment. Extracting spreadsheets for statistical analysis, a PDF of relevant transcriptions for a conference, or creating a series of HTML pages for a website, could all be done relatively quickly from a single source of data without duplicating or damaging it. Because all these derivatives came from a single source, they were consistent in their content and required a single workflow for their maintenance and expansion.
The Pain and Lamentation of Retroactive Data Management
Had I been starting a wholly new project, with new data and new objectives, creating a detailed data management plan using XML would have be a relative simple and obvious step in my research work-flow. I could have created a schema to ensure consistency alongside a detailed dictionary of the keywords and entities for encoding the text. I was not, however, beginning a new project. By the start of 2014 I had over 1000 transcriptions in various states and in various formats, not all of which had images to reference for typographic details. This was far too much data to discard, but also a seemingly unassailable backlog.
The first step was to get my untranscribed images into a basic XML framework. This way, my entire collection would, at the very least, be machine readable. This was done with a small research grant from my faculty to employ a MA student to transcribe 300 of the untranscribed articles from my Evernote collection, including basic bibliographic, typographical and entity tagging (mentioned places). This brought a large amount of data into a useful format very quickly (approximately 4-6 weeks). Unfortunately, as my XML schema was still ever-so-slightly in flux, a large number of the entity tags are add-ons, placed at the end of the entry, rather than surrounding the actual text to which they referred. This, I am sure, will haunt me later on.
The next step was to either
a. Finish the tagging of the newly entered transcriptions
b. Import my existing transcriptions into the XML database
c. Begin transcribing new articles, recently selected for an upcoming seminar paper
For obvious reasons, I chose to begin with C, though I do undertake A and B as and when other commitments allow. What is important is that A and B exist, and that I have no intention of abandoning them.
Data management is an important part of any research project and should always, if possible, be done at the start of the project. This allows for consistency, repeatability, and reuse of your material in the future. However, we do not always live in a world of carefully structured research goals and outcomes. Your academic plans will sometimes (read: always) branch off onto strange and unexpected avenues. This is to be expected and reveled in.
The above story of hording and evolution demonstrates the unfortunate missteps we can all make in our pursuit of new projects and better storage options. What would be worse, however, would be to abandon our past rather than learn from it. Integrating old data into a new data management system is not only possible, it may be hugely beneficial to new research. My current findings on Scots-Australian reprint networks would not have been possible if I had not kept my old data and found a way to integrate it with my new work-flow—and not just because of the raw data itself. Having to go back and double-check bibliographic information, transcription accuracy and coding consistency brought wholly new aspects of these sources to my attention, leading to an article that was an absolute joy to write.
In my next, I will describe how (re)storing your primary sources in XML, even a very basic form of it, can revolutionise your practice as a traditional historian and remove some of the guilt of your youthful indiscretions.