After nearly a year of cleaning, processing, refining, reprocessing and analysing data, I felt cautiously optimistic about my results. Derived from the British Library 19th Century Newspapers collection, I had compiled a list of reprinting—perhaps better characterised as duplication—between 1800 and 1865. After initial filtering, I was the happy owner of 780 individual monthly manifests of significant, time-sensitive instances of non-self-duplication; in other words, all matches of at least 200 words that were not within the same publication but were within 15 days; in other, other words, scissors-and-paste news.
Random sampling suggested a very low incidence of non-news content (lottery and patent medicine adverts being the prime culprits) and I set about drawing initial conclusions for a working-papers symposium hosted by Will Slauter of the University of Paris. There, I laid before colleagues three premises:
- The average size of a match was roughly 300-600 words across the period (1822-1862)
- The average number of pages per issue with some duplicated material rose dramatically in 1837 and 1855
- The relative percentage of each issue containing duplicated material rose throughout the period
All three of these hypotheses will be expanded and refined over the coming months, but one, perhaps off-handed, request for clarification struck me as particularly interesting: How long was a normal newspaper article in this period? I had a rough answer, but as I spoke I realised that the level of variation across a given newspaper, across different genres of newspaper content, was so great that the matter required more rigorous documentation. I thus began to sample newspapers across the period. I began with June—a month randomly selected—in years ending in 0 and 5.
Mapping an Issue
Unfortunately, such analyses could not be done computationally with my existing data. Although the corpus does contain an article-level variant, the division between news stories is at neither a sufficient nor consistent resolution—the paucity of typographical, rather than semantic divisions between stories makes accurate division nearly impossible. Instead, I used the page-level OCR transcriptions, dividing them manually at story breaks—points where the topic or voice clearly changes based on close reading.
I calculated the word count of each article, advert and snippet, using these OCR transcriptions; comparing a random sample against manual word counts from the facsimile images found the OCR transcriptions largely accurate (within 5%); none of the pages suffered a level of OCR errors that would have made manual counting a necessity. This data was then inputted into Excel (a tutorial on achieving this with non-proprietary software would be much welcomed), each row an article title and each column a column of the original page. Visualised as a column graph, with each set representing 100% of the column, resulted in following four images.
In contrast to its original or facsimile representation, this re-visualised version of the Caledonian Mercury makes immediately clear the variety of article lengths present within a single issue. The first page, composed entirely of advertisements, contained three principal types: shipping, real estate and branded notices. The first column contains mostly shipping notices, around 75-100 words in length. The second and fourth contain larger national or regional adverts for the Kelso Races and Sun Fire Office, over 600 words in length. The remainder of the advertising space was allocated to medium-sized notices for real estate and printed works, all roughly 200 words long. Thus, taken as a whole, the advertisements within the Caledonian Mercury almost always fell between 150-250 words, with branded advertisements the notable exception.
Moving onto the second page, we find that almost half the copy is dedicated to parliamentary debates—suggested in other issues to be obtained from private correspondents rather than through scissors and paste. At 4304 words, it is undoubtedly the longest single ‘article’ within the issue. The third column is completed by five separate notices from Buenos Ayres, dated ten-to-twelve weeks earlier, ranging in size from 41 to 487 words. As a bundle, whether compiled by the Mercury or an unknown third-party, this ran 1045 words long. The next column contained reprints from London and (ostensibly) Parisian newspapers, though computational matching suggests both sets were derived from publications in the British metropolis. The first two articles, at roughly 500 words, are the largest domestic news snippets in the issue, followed by snippets of about 75-100 words, the seeming standard for domestic news. The final column begins with a series of celebrity and sporting news items, again from London and measuring just over 100 words each, followed by another round of even short snippets, all around 60 words, a selection of market prices, for which word counts are an unreliable measure, and concludes with two short editorial commentaries on previous news items. In sum, the second page is also entirely comprised of London-derived content, including lengthy parliamentary reports, relatively short royal, celebrity and sporting gossip, and very short London snippets. The only exception are the medium-length accounts from Buenos Ayes in the third column, whose immediate origins remain unclear.
News, Foreign and Domestic
Page three is both more and less consistent, containing an almost even division of column inches between very long (700+ words) articles and short (<150 words) snippets. The longest, again, is a further account of parliamentary debate (1364), followed by a lengthy letter to the editor (1038), book extract (736), and account of the Edinburgh Races (809). The remainder are a collection of private intelligences on foreign markets and industries, commentaries on political debates, and local Scottish news. Whereas nearly all the second page can be identified as either a reprint or significant paraphrase of London newspaper content, this page has only three likely reprints, all of which are longer-than-average and fully attributed. Whether the very short Scottish snippets (<100) are reprints or retellings is yet undetermined.
The final page of the issue is generally home to several longer pieces, including lengthy discussions of dissenting churches in Scotland, 18th century women’s fashion and the navigation laws, from Mercury, Chamber’s and Times, respectively. Excluding the final two columns, filled with local and London prices, the remainder the page is comprised of short-to-medium (100-600 word) pieces of miscellany. The shorter, more specific pieces were generally obtained from other Scottish or northern English newspapers and given either implicit or direct attribution. The longer texts fall under the heading of time-insensitive miscellany—or urban legend—such as advice on “How to kill a rattlesnake” or the relative benefit of using cotton bandages on burned skin.
Scissors and Paste
Excluding those dedicated to advertising, every page of the issue had some form of scissors-and-paste material. Much of this was implicitly signalled through the geographic heading “London”, even when attributions suggested an original foreign origin. Instances of significant duplication outside the “London” subsection, conversely, were usually directly attributed, with most Scottish and local news reduced to unattributed snippets of fewer than 100 words. Although variation clearly existed, the average article size, like the average advertisement length, was roughly 200 words.
On the one hand, the average word count for this issue of the Mercury was below that calculated computationally for the period, even if that period is reduced to the 1820s or 1825 alone—in both cases, most computationally derived matches are between 300-600 words long. However, if matches below 200 words are excluded, as they had been in the computational analysis, the average rises to 634, matching the computation analysis but excluding nearly 75% of articles from consideration.
In the first iteration of this project, the minimum register-able match had been 80 words, a setting that would have included 60% of those examined here. After processing several decades, I had concluded that such a fine resolution was unnecessary to identify instances of scissors and paste; such small matches rarely if ever appeared in my initial results and were almost always false positives. Nonetheless, nearly all the attributed duplications in this issue were under 150 words, along with many of the geographically implied reprints.
This has led me to the uneasy conclusion that while the 200-word filter was sufficient to capture nearly all duplications in the early years of this corpus, it will inevitably fail as the corpus is expanded to include more titles. Moreover, my analysis of duplication size was perhaps unfairly predicated on preconceptions—based on soundings and samplings early in my research—that larger articles represented ‘real’ scissors-and-paste journalism, and smaller snippets were merely paraphrased recounting.
Over the next few weeks, I will continue to develop a catalogue of newspaper anatomies in the hope of answering, once and for all, what was the average length of a 19th-century newspaper article?