Programming Plagiarism: Final Thoughts on a Multimodal Analysis of Scissors-and-Paste Journalism (Part 4)

This is part 4 of a 4-part series. Parts 1, 2 and 3 of this essay are available here.

Having examined a variety of textual, contextual and intertextual clues about newspaper networks, and the multimodal approaches historians can use to obtain them, we now turn our attention to bring that data together in meaningful ways.

Creating an un-weighted, directional network diagram of the late-Georgian press should be a relatively straightforward task. Once a single connection is established, via attributions or through comparing texts, a researcher can manually input this information into a matrix. From this, he or she can create network visualizations with programmes such as Gephi or UCINET to provide a map of possible network connections.

Yet, when dealing with a large, multimodal corpus, establishing a connection is fraught with difficulty. Data collected from disparate sources and confirming different types of links—institutional and geographical—need to be knitted together and weighted appropriately.

What needs to occur, in brief, is a translation of the implicit experience-based hunches historians use to determine dissemination order into a list of clearly delimitated criteria that can be understood by a computer and translated into specific actions for that computer to take. The computer is not magically determining connections that are impossible for the human researcher; indeed, a computer (yet) cannot undertake an action it has not been specifically programmed to do. Instead, using a computer to map dissemination pathways is merely forcing us to explain our criteria with exceptional clarity and then allowing a computer to undertake the sorting process for at a rate that a human research would be unable to match.

The most useful task that can be delegated to a computer is clearly the examination of article text for consistency and change. Because of the manner in which news content was transmitted, literally cut from one periodical and re-set for another, the means of comparison can be fairly simple. Unlike analyses for determining authorship, which must define characteristics of authorial voice such as sentence structure and word choice, we must only look for the most basic indicators of plagiarism, the duplication of individual phrases or larger sections of text. When undertaking this task manually, as any marker of undergraduate coursework will know, we can usually deduce the likelihood something is plagiarized by not only comparing the text word for word, but also by discounting common phrases or standard meanings for describing well-known events or persons.

Knowing this, how do we translate this gut-reaction to a computer?

Commercial and open-source plagiarism tools, such as TurnItIn and WCopyFind offer researchers a quick and user-friendly experience in identifying copies. Both include options for making a search more or less fuzzy by looking for larger or shorter nGrams, ignoring or including numerals and punctuations, and allowing for slight or egregious errors in the copying of text – that is, the changing of one or two words in an otherwise lifted statement. When run over a corpus, they can identify possible matches between two texts (in the case of WCopyFind) or indicate possible sources for individual phrases (in the case of TurnItIn). While both methods are useful for small scale projects, wherein the researcher can use these indicators to inform which sources he or she should manually examine, neither allows for a fully automated ranking of which source was most likely used and certainly cannot easily be translated into a branched map of dissemination.

For automation, we must instead build up confidences. How likely, based on different criteria, is one version of a text the immediate predecessor of another? The first step is to create a group, or sub-corpus, of articles that share at least some text. For the best results, the comparisons should be multi-layered, comparing

  • Differently sized nGrams, to weed out common phrases but account for textual collages or very short snippets
  • Individual sentences, based on punctuation
  • Individual paragraphs, delimitated by XML tags

Unlike common plagiarism detection programme, this should create a much larger selection of related texts, requiring further examination to determine their place in the wider network. Once a sub-corpus of matches has been created, they must be ordered. The simplest and most straightforward method is chronological. Using metadata, texts can be ordered by date of publication. In addition to creating a simple timeline of appearances, this information, combined with geographic detail, can help determine the likelihood of a direct connection. For example, one text cannot be the immediate source of another if they were

  • Published at the same time, such as both being morning editions on the same day
  • Published in sequence but at an impossible distance, such as a paper appear in New York on one day and London the next, before the advent of telegraphic communication.

Other date discrepancies, such as three days between publication in London and Edinburgh, allowed for postal transmission, selection and typesetting and is therefore reasonable and, indeed, common. Once a basic timeline is created, and obvious impossibilities removed, the programme can go about comparing an article with its predecessors to determine the most likely source. Using nGrams, sentences and paragraphs, we can give a raw percentage representing the degree to which this version resembles each previous incarnation.

The analysis can also examine the addition or omission of text from previous versions, each creating a new branch. If omitted text reemerges at a later date, the later version must have copied from the original, rather than its predecessor. If supplementary text is later omitted, it is more likely, but not guaranteed, that this later version used the original as well. The same method can also be used to identify collages. If the Caledonian Mercury contained articles from three different London periodicals, and the Aberdeen Journal contained a single article with information from these three different articles, it is more likely that it used the Caledonian Mercury as a source than obtaining the information independently from London. Finally, very simple changes between otherwise faithful reproductions can be telling. Retained mistakes, such as misspellings, incorrectly identified individuals, and transposed characters, as well as the use of abbreviations or the replacement of words with numerals, all act as markers.

Once textual clues are weighted, we can look at the wider network and see how likely it is that this type of article, in this type of publication, came from this type of source:

  • How likely was this newspaper to get its news from London?
  • To get its Australian news from London?
  • To get newspaper summaries from London?
  • To get news from the Morning Chronicle?
  • From the London Examiner?
  • From the editor’s brother-in-law at Gravesend?

 

These likelihoods can be derived most accurately from manually entered metadata, as well as previous mappings of the network, but may also be scanned for automatically. By programming in the names of all known newspapers and periodicals, plus common phrases such as from the “Captain of the” or “From the Sloop” or “By the latest papers from”, the computer can search for references within the first or last twenty words of an article. The accuracy of this method, however, is questionable, and may provide more false positives than may be acceptable to a given researcher.

Once all possibilities are examined, and all likelihoods weighted, these can be combined and contrasted, and a most likely source(s) can be declared, alongside a percentage representing the confidence the available information allows. Accepting the most likely source is correct, so long as it meets a basic confidence threshold, allows the researcher to create linkages between individual articles along a dissemination pathway, as well as a weighted network between individual newspaper titles.

While this series of questions and computations, at first, seems daunting, it can be reduced to a much simpler form. Envisioned as a series of flowcharts, each criterion is established independently through a series of simple yes-or-no questions; questions asked by most historians during traditional primary source analysis.

Moreover, while such a computer programme could be written in variety of coding languages, one of the most useful may be Python. Python is a high-level computer programming language and, as such, it is very similar to standard written English and therefore has a relatively gentle learning curve. Although Python runs more slowly than languages such as Visual Basic or C++, it allows most historians, with a very small amount of autodidactic training, to develop and continually refine the code to match their own particular mapping criteria, rather than rely upon accurately expressing your implicit methods of historical deduction to a programmer with little or no historical training. Once these functions prove robust for the task at hand, they can be translated into lower-level languages for faster delivery.

Thus, thanks to the ever-growing range of digitized material, and the rapid improvement of computer processing power, historians are now able to map what was once un-mappable, using our own time-tested methods of historical deduction, sped up and made consistent through the use of digital tools that we, as historians, can and should engage with.

**Image Courtesy of Krystian Majewski

Leave a Reply