This originally appeared as a 4-part blog series in April 2014. It has been consolidated here 1 June 2020.
The concept of scissors-and-paste journalism is not a new. Indeed, the practice of obtaining, selecting and faithfully reproducing news content (without attributing its original author) dates to before the advent of what we would now call the newspaper and into the years of the handwritten newsletter. That historians have not, until very recently, explored the specific nature and nuance of these reprinting practices is simple pragmatism. Whether they are attempting to uncover the dissemination pathway of single article, or to understand the exchange practices of a particular newspaper title, the task is a daunting one.
In order to achieve one-hundred-percent confidence in any given dissemination map, a historian would need to have read every newspaper ever printed, along with the personal papers of every newspaper editor, compiler and printer that has ever lived, and, for good measure, they would need to develop robust methods for examining the conversations that had taken place in every coffee house, tavern and postal exchange across the breadth of the world and throughout the entirety of time.
This, of course, is probably too much to ask of any historian, even a very diligent one.
Although we may never achieve one-hundred-percent confidence, recent developments have made achieving a reasonable degree of certainty more likely. The ongoing digitisation of historical newspapers has made it possible to obtain regular access to a larger percentage of possible reprints. Of even greater assistance are the efforts of certain digitisation projects to provide users with direct access to the machine-readable transcriptions of these digitised images. These texts, obtained through optical-character recognition software (such as that employed by Chronicling America) or manual transcription and OCR-correction (such as that championed by Trove) lay hidden behind most searchable databases, but only a select few providers have thus far made them accessible, indeed mine-able, by the general public. Yet, even when they remain hidden, and even if their quality remains highly variable, their existence has revolutionised research into dissemination pathways and provided the intrepid reprint hunter with two novel modes of inquiry.
The first is for the historian to select a set of articles for which there is good reason to believe a reprint exists, or for which he or she has already identified a number of reprints in the past. Having identified an appropriate text, the historian can then search for a selection of keyword phrases, or nGrams, in the relevant newspaper databases in order to obtain a reasonable number of hits.
This method of reprint analysis is hampered by many limitations. At best, the historian has a limited idea of where reprints, or indeed the original version, of any given text may appear. The proven commercial viability of newspaper digitisation, for genealogical and historical research, as well as the efforts of public or part-public projects, has led to an ever-growing number of online repositories. For any dissemination map to be considered robust, each of these must be searched with a consistent list of keyword strings, representing different portions of the article.
More importantly, mechanical limitations, such as variances in search interfaces or the quality of machine-readable transcriptions, often obscure the true reach of a given text. Even if a legitimate version of article does exist within the database, these variances means that there is no guarantee the researcher will ever find it without manually examining each individual page.
Finally, even supposing the historian is able to identify all versions of a given text within all current newspaper databases, this still represents only a tiny percentage of all possible prints. As with any preservation project, the costs associated with digitisation have led to the subjective selection of popular, representative or historically important titles from an already reduced catalogue of surviving hard-copy newspapers. Likewise, even if a newspaper has been selected for preservation, multiple editions and non-surviving issues mean that true certainty will always remain elusive, even with manual examination.
Another method for determining reprints is to retrieve machine-readable transcriptions en masse and analyse them for duplicated phrases or word groupings, a method currently employed by the Viral Texts project. This methodology has significant advantages over manual search-and-inspection research. First, the historian no longer needs to make an initial identification of an article for which there are likely reprints; instead, all articles can be compared with all other articles, highlighting new and perhaps wholly unexpected ‘viral texts’. Second, by using a computer processor, rather than the eyes and mind of a single historian, the time spent in research is vastly reduced, perhaps transforming a lifetime of work into a few dozen hours.
There are, of course, also disadvantages. Although seemingly more efficient than using a database’s propitiatory search interface, this method requires full access to the raw OCR data, something provided by only a minority of databases. It also required a highly specialised procedure for cleaning that data to a level at which no reprints will be excluded—a procedure, moreover, which must be refined to accommodate a range of dialects, typefaces and discourses. Finally, and worryingly, complete reliance on computer matching means that significant OCR errors, those that cannot be overcome through pre-designed replacement protocols, will forever obscure some reprints.
The irretrievability of a certain percentage of reprints is not, of course, a primary concern of the Viral Text project, whose aim is to examine which ‘qualities—both textual and thematic—helped particular news stories, short fiction, and poetry “go viral” in nineteenth-century newspapers and magazines.’ For others, such as myself, who are primarily concerned with the path these texts took, and the practical mechanisms associated with their transmission, we are seemingly left with the unsatisfying conclusion that no true map of the dissemination networks can ever be devised.
Yet, all hope is not lost.
Uncovering how pre-telegraphic newspapers obtained and distributed news must be a collaborative effort. However diligent they may be, and from whatever backgrounds they may hail, individual researchers suffer from the same restraints that have plagued the acquisition of human knowledge for millennia: a lack of time, money and resource. Moreover, these three factors are particularly harmful to the mapping of reprint networks because of the interdisciplinary nature of such a pursuit.
Before the 1840s, and indeed long afterwards, the transmission of news relied upon its physical movement across oceans and along roads, rails and rivers. The distribution of information, held statically in written material as well as mutably in rumour and conversation, relied upon physical social interaction. Whether employed as a courier or undertaking a personal transfer, information could not move without two persons physically meeting—save semaphore or a message in a bottle. Thus, to map the dissemination pathways is to map a social network.
This is not a new concept, of course. The mapping of communication networks is at the core of social network theory. What makes the mapping of 18th and 19th century newspapers particularly problematic is the fact that, unlike modern sociologists or computer scientists, historians have little hope of obtaining a statistically significant sample of network interactions, or, tragically, even determining what a statistical significant sample would comprise.
Leaving aside for the moment the ever-daunting reality of forever-lost material, let us concentrate, for example, on the existing corpus of nineteenth-century British periodicals. What comprises a statistically significant sample? The British Library’s Nineteenth-Century Newspaper Collection could, perhaps, be considered one. Careful consideration of regional, temporal and thematic breadth was undertaken when titles and date-ranges were selected for digitisation. Would mapping the dissemination pathways within this single corpus provide a representative sample?
That, of course, is not the right question. The correct question is ‘how do we map the dissemination pathways of this corpus in the first place?’
Discovering the secret history of a piece of text relies upon corroboration. No one method will suffice because we have no way of determining the repeatable accuracy of any particular methodology. Instead, we have to build up confidences; how likely is it that this scrap of text was obtained in this particular way at this particular time? Which clues have been left behind?
The Dateline – This tiny strip of text, a date and location indented at the start of an article, is our first clue. This snippet does not detail the time and location of the event described, as we would assume from modern journalistic practice. Instead, the dateline provides the writer’s source. A piece on the French capital might have the dateline “Paris”, but was just as likely to say Le Havre, Amsterdam or London, as it indicated the location of the newspaper’s informant, a source who may have only obtained the news second or third-hand themselves. Thus, you might be tempted to draw a connection, and edge, between a newspaper from Leeds or Edinburgh to one of these mighty ports and onward to the city of interest—but you would be treading on dangerous ground. Like any other word or phrase of a reprint, the dateline could be nothing more than a copy. Although a London daily might have a direct connection to Vienna, from whom it received Austrian updates, nothing demanded a Glaswegian editor change “Vienna” to “London” when the news was reprinted. Indeed, this would have been seen as lessening the authority of the information. So, while the dateline indicates a node on the map, the connection remains obscure.
The Section – As with the dateline, the source of the information can be gleaned from the heading above it. Those labelled London or France usually indicated that the material had arrived via post or courier from that location; yet others, such as New South Wales or West Indies might indicate origin, but might also only indicate the topic. Nonetheless, mapping the frequency of these sections across an entire run does add a layer of confidence that a particular newspaper did have some connection with these locations.
The Attribution – This beautiful rarity, laying majestically at the top or bottom of a text, declares boldly the source of the content. Despite the frustrating prevalence of A London Paper—or worse, An Evening Paper—these direct attributions offer the clearest and most unequivocal evidence of the path the news took. Yet, like the dateline, it at best proves only a node, not the previous node. At worst, it is a misattribution, leading the poor researcher down a blind alley.
The Reference – A bit more subtle than the attribution, the reference is a meandering nod to the source (at least, to the source of the writer of a particular text). ‘We find in the Examiner’ perhaps, or ‘An examination of the London dailies shows’. Provisos and prejudices still apply.
The Chronological Consistency – Explicit identifiers are helpful, but are fraught with dangers. Perhaps the most reliable way to track the spread of news is to seek out consistent copies and order them chronologically. If an identical piece appears first in the Morning Chronicle, and then the Scotsman and then the Berwick Advertiser, with sufficient time separating them to account for movement, it is perfectly logically to assume it took just that path. The chronological consistency model, however, does not account for splits and splinters. How are we to know, for example, that the Advertiser copied from the Scotsman and not the Chronicle directly?
The Change – Splits and splinters are better accounted for by changes and, in particular, errors: An omission or addition of an adjective; The misspelling of a key name; The reordering of the text. Any change, great or small, lends a clue to the evolutionary branching of a given news item.
The Exchange – But perhaps the best clues are not in the text at all. Business records and personal correspondence between newspapers can provide crucial information. Many newspapers maintained subscriptions, formal or informal, to other newspapers. These were sometimes even declared in the first issue of a new regional title. Victorian newspapers, especially, maintained independent exchange editors, who duty it was to scour incoming publications for the best and most tantalising snippets. If a formal exchange between papers existed, it adds another layer of confidence that a particular pathway is correct.
The Family – Newspapers (and their editors) bred. Regional and colonial printers were often the former journeyman of older, more established papers, and children and siblings often joined in the family business. These young men or women often maintained friendly if not symbiotic relationships with their former employers and guardians. The uncreatively named Sydney Herald, for one, was very helpful in supplying Antipodean news to its Glaswegian namesake. With a bit of genealogical elbow-grease, we add another layer.
The Well-Worn Path – Once these layers are built up, once we know the postal roads and sea-routes best travelled, or the pages to which the editor’s shears most naturally move, we can see the paths of least resistance. An 1810 snippet on the Swan River settlement with no date, no location, no section, no attribution, not even a passing mention of its meandering path, might still find a place in network. If, before 1815, every identifiable bit of Australian news that appeared in Caledonian Mercury came from the Morning Chronicle. Why not this one too?
With these manifold layers in mind, how best do we approach our statistically significant sample?
Having identified the many types of information that can help us understand these historical social networks, which is the best method for bringing them together? As with textual and contextual clues, the best approach for understanding wider dissemination networks is multi-layered, or, as the title of this series suggests, multimodal.
The first and most obvious model divide is between the printed and the digital. This distinction is, however, slightly misleading, as a great deal of newspaper content is currently only available in microfilm format or through digitized images that, containing no machine-readable metadata, must be treated in a similar way as microfilm (see, for example, the 18h century holdings of Google News Archive). As both these formats must be examined manually, that is without the aid of keyword or full-text search functionality, we shall treat them alongside manual examination of loose or bound originals.
Manual examination, wherein a researcher flips, winds, or clicks through a chronological series of issues within a single newspaper title, has several advantages and limitations when mapping dissemination pathways. First, it allows, even encourages, the mapping of an entire title, rather than a few serendipitously chosen articles within it. Moving methodically from page to page, and issue to issue, the researcher has two, perhaps overlapping, options: manual cataloguing and manual transcription.
The first, in essence, is the creation of a highly specialized index for the title. Using a spreadsheet or database—or indeed pen and paper—the researcher can catalogue each individual article in a run, detailing its title, author, dateline, topic, and any references to its source or origin. The level of detail available from purely textual material is, of course, limited, and may result in wild fluctuations in the accuracy of any given network cluster. It does, however, have the advantage of giving a broad overview of the title, and how its different content types—shipping news, parliamentary news, foreign news, colonial news, miscellany—relate to each other in form and origin. Are different topics sourced from different, interlocking networks or does the paper as whole following a consistent pattern of news-gathering?
Moreover, by limiting the scope of the project to cataloguing textual clues, a researcher can theoretically move quickly from page to page, issue to issue, year to year, creating a tremendously useful resource in a relatively limited period of time—time being a particularly precious commodity when using material housed within a library at a distance from the researcher’s base. Even if the catalogue is limited to a particular type of content, or topic, it can provide a detailed network map against which the researcher, or a successor, can contextualize other material.
The second option is to catalogue metadata, such as the section or page number, alongside complete, manual transcriptions of articles, creating a corpus of machine-readable texts. Although these are seemingly less complete than fully digitized versions of the material, markup-languages such as XML can provide typesetting and other spatial information. These transcriptions can then be used alongside those obtained via optical character recognition to map changes and continuity in the dissemination of individual articles across many different titles. The costs and technical eccentricities associated with digitization and OCR projects currently limits our ability to undertake large-scale analyses of periodical networks. However, the creation and dissemination of machine-readable transcriptions for even a sub-section of this un-digitized material, collected to inform cognate projects, could revolutionise computer-aided analyses of newspaper networks.
Machine-readable digital content, of course, can and must be treated differently from printed or microform versions. Providers, such as the Library of Congress and Trove, that offer API access to digitized newspaper text offer researchers a tantalizing opportunity; they can bypass the messy, time-consuming process of manual transcription and delve into computer-aided analysis of the text itself – a process that I will address at greater length in my final post. However, as has been noted by David Smith, Ryan Cordell, Elizabeth Dillon and Charles Upchurch, this material, however kindly provided, is not always suitable for immediate analysis, owing to errors in the transliteration of images into machine-readable code. Nonetheless, reasonable corrections can be made through the use of dictionaries and replacement protocols – checking transcriptions against an appropriate dictionary, selecting unrecognized words, replacing commonly mis-transcribed characters and the checking the new word against the dictionary once more. Once sanitized to a reasonable degree, these texts can be stacked, or grouped, by textual similarity – that is, by the percentage of common nGrams – and then checked for consistency and change to determine likely pathways of dissemination.
Where machine-readable content exists, but is not readily available to researchers (as is the case with most commercial newspaper archives), alternative tactics must be used. One of the most straightforward, and immediately profitable, is that used by media historian Bob Nicholson in his delightful (honestly, go and read it) article on the dissemination of American jokes in British periodicals. Using full-text searching, a researcher can hunt for, rather than gather, relevant reprints by crafting a series of relevant search phrases – and crossing their fingers. Again, those working on very particular topics, a key event or literary text, can develop case-study networks that, when combined with title-wide maps or a plethora of other small-scale clusters, can yield impressive results. Likewise, commonly used textual indicators, such as ‘from the London Examiner’ can be themselves searched for and catalogued.
Thus, by combining the results of meticulous, chronological examinations, surface-level catalogues, and digitised nGram hunts, a community of researchers could, in theory, develop a multi-layer, multi-modal network diagram.
But how do we even begin to gather these far-flung resources together, and how do we knit them together once they appear.
Having examined a variety of textual, contextual and intertextual clues about newspaper networks, and the multimodal approaches historians can use to obtain them, we now turn our attention to bring that data together in meaningful ways.
Creating an un-weighted, directional network diagram of the late-Georgian press should be a relatively straightforward task. Once a single connection is established, via attributions or through comparing texts, a researcher can manually input this information into a matrix. From this, he or she can create network visualizations with programmes such as Gephi or UCINET to provide a map of possible network connections.
Yet, when dealing with a large, multimodal corpus, establishing a connection is fraught with difficulty. Data collected from disparate sources and confirming different types of links—institutional and geographical—need to be knitted together and weighted appropriately.
What needs to occur, in brief, is a translation of the implicit experience-based hunches historians use to determine dissemination order into a list of clearly delimitated criteria that can be understood by a computer and translated into specific actions for that computer to take. The computer is not magically determining connections that are impossible for the human researcher; indeed, a computer (yet) cannot undertake an action it has not been specifically programmed to do. Instead, using a computer to map dissemination pathways is merely forcing us to explain our criteria with exceptional clarity and then allowing a computer to undertake the sorting process for at a rate that a human research would be unable to match.
The most useful task that can be delegated to a computer is clearly the examination of article text for consistency and change. Because of the manner in which news content was transmitted, literally cut from one periodical and re-set for another, the means of comparison can be fairly simple. Unlike analyses for determining authorship, which must define characteristics of authorial voice such as sentence structure and word choice, we must only look for the most basic indicators of plagiarism, the duplication of individual phrases or larger sections of text. When undertaking this task manually, as any marker of undergraduate coursework will know, we can usually deduce the likelihood something is plagiarized by not only comparing the text word for word, but also by discounting common phrases or standard meanings for describing well-known events or persons.
Knowing this, how do we translate this gut-reaction to a computer?
Commercial and open-source plagiarism tools, such as TurnItIn and WCopyFind offer researchers a quick and user-friendly experience in identifying copies. Both include options for making a search more or less fuzzy by looking for larger or shorter nGrams, ignoring or including numerals and punctuations, and allowing for slight or egregious errors in the copying of text – that is, the changing of one or two words in an otherwise lifted statement. When run over a corpus, they can identify possible matches between two texts (in the case of WCopyFind) or indicate possible sources for individual phrases (in the case of TurnItIn). While both methods are useful for small scale projects, wherein the researcher can use these indicators to inform which sources he or she should manually examine, neither allows for a fully automated ranking of which source was most likely used and certainly cannot easily be translated into a branched map of dissemination.
For automation, we must instead build up confidences. How likely, based on different criteria, is one version of a text the immediate predecessor of another? The first step is to create a group, or sub-corpus, of articles that share at least some text. For the best results, the comparisons should be multi-layered, comparing
- Differently sized nGrams, to weed out common phrases but account for textual collages or very short snippets
- Individual sentences, based on punctuation
- Individual paragraphs, delimitated by XML tags
Unlike common plagiarism detection programme, this should create a much larger selection of related texts, requiring further examination to determine their place in the wider network. Once a sub-corpus of matches has been created, they must be ordered. The simplest and most straightforward method is chronological. Using metadata, texts can be ordered by date of publication. In addition to creating a simple timeline of appearances, this information, combined with geographic detail, can help determine the likelihood of a direct connection. For example, one text cannot be the immediate source of another if they were
- Published at the same time, such as both being morning editions on the same day
- Published in sequence but at an impossible distance, such as a paper appear in New York on one day and London the next, before the advent of telegraphic communication.
Other date discrepancies, such as three days between publication in London and Edinburgh, allowed for postal transmission, selection and typesetting and is therefore reasonable and, indeed, common. Once a basic timeline is created, and obvious impossibilities removed, the programme can go about comparing an article with its predecessors to determine the most likely source. Using nGrams, sentences and paragraphs, we can give a raw percentage representing the degree to which this version resembles each previous incarnation.
The analysis can also examine the addition or omission of text from previous versions, each creating a new branch. If omitted text reemerges at a later date, the later version must have copied from the original, rather than its predecessor. If supplementary text is later omitted, it is more likely, but not guaranteed, that this later version used the original as well. The same method can also be used to identify collages. If the Caledonian Mercury contained articles from three different London periodicals, and the Aberdeen Journal contained a single article with information from these three different articles, it is more likely that it used the Caledonian Mercury as a source than obtaining the information independently from London. Finally, very simple changes between otherwise faithful reproductions can be telling. Retained mistakes, such as misspellings, incorrectly identified individuals, and transposed characters, as well as the use of abbreviations or the replacement of words with numerals, all act as markers.
Once textual clues are weighted, we can look at the wider network and see how likely it is that this type of article, in this type of publication, came from this type of source:
- How likely was this newspaper to get its news from London?
- To get its Australian news from London?
- To get newspaper summaries from London?
- To get news from the Morning Chronicle?
- From the London Examiner?
- From the editor’s brother-in-law at Gravesend?
These likelihoods can be derived most accurately from manually entered metadata, as well as previous mappings of the network, but may also be scanned for automatically. By programming in the names of all known newspapers and periodicals, plus common phrases such as from the “Captain of the” or “From the Sloop” or “By the latest papers from”, the computer can search for references within the first or last twenty words of an article. The accuracy of this method, however, is questionable, and may provide more false positives than may be acceptable to a given researcher.
Once all possibilities are examined, and all likelihoods weighted, these can be combined and contrasted, and a most likely source(s) can be declared, alongside a percentage representing the confidence the available information allows. Accepting the most likely source is correct, so long as it meets a basic confidence threshold, allows the researcher to create linkages between individual articles along a dissemination pathway, as well as a weighted network between individual newspaper titles.
While this series of questions and computations, at first, seems daunting, it can be reduced to a much simpler form. Envisioned as a series of flowcharts, each criterion is established independently through a series of simple yes-or-no questions; questions asked by most historians during traditional primary source analysis.
Moreover, while such a computer programme could be written in variety of coding languages, one of the most useful may be Python. Python is a high-level computer programming language and, as such, it is very similar to standard written English and therefore has a relatively gentle learning curve. Although Python runs more slowly than languages such as Visual Basic or C++, it allows most historians, with a very small amount of autodidactic training, to develop and continually refine the code to match their own particular mapping criteria, rather than rely upon accurately expressing your implicit methods of historical deduction to a programmer with little or no historical training. Once these functions prove robust for the task at hand, they can be translated into lower-level languages for faster delivery.
Thus, thanks to the ever-growing range of digitized material, and the rapid improvement of computer processing power, historians are now able to map what was once un-mappable, using our own time-tested methods of historical deduction, sped up and made consistent through the use of digital tools that we, as historians, can and should engage with.