The Kubhist corpus of Swedish newspapers

Submitted by Dana Dannélls on 2019-09-15

Among the flurry of Språkbanken’s historical resources we find the Kubhist corpus – a diachronic collection of historical newspaper texts – in two versions: Kubhist 1 spanning the time period of 1750–1950, and Kubhist 2 spanning the time period of 1645–1926. Historical corpora of this kind, especially when available in searchable format, are valuable sources of information for learning about our history, language and culture. These are especially appealing for researchers coming from the digital humanities who study history, literature, linguistics, sociology and ethnology, just to name a few.

The first Kubhist corpus was born through a digitization effort at the national library of Sweden (KB) about a decade ago. The original editions of the newspaper texts were scanned and processed by an automatic transcription process called Optical Character Recognition (OCR) and were made available electronically on the Swedish newspaper archive. Shortly after the release of the material it was handed to Språkbanken. Thereafter, both the original pdf files and the OCR-processed and searchable text files, became available through Korp’s search interface. Unfortunately, access to this material is limited because the texts contain many word and character errors that were caused by the automatic transcription. For example, the texts contain words such as: mslag, indag, lnbg, Inbq instead of inslag; dk, bolk, lolk instead of folk; hcrre, hcrrc, bcrre, lcrre instead of herre; and ridning, dning, tidnlng, lidning instead of tidning. So when the user searches for the correct word in the Kubhist corpus, the returned search results will not include all the occurrences of that word.

The second Kubhist corpus with over 5.5 billion tokens is a result of a recent external digitization funding, where additional newspaper editions were scanned and re-processed through an enhanced OCR-process at KB. Although the quality of the automatic transcription process has improved significantly, the amount of erroneous transcriptions remains high. In a paper that we presented at the Digital Humanities in the Nordic Countries (DHN) conference we provide an overview of the different errors that were caused by the automatic transcription of the Kubhist 2 corpus.

In the beginning of this year we launched a research infrastructure project in collaboration with KB, where the primary aim is to improve the automatic transcription process and thereby increase the access to the material. If you are interested, you can read about the project in the paper that we also presented at DHN this year. Here I will only briefly summarize some of the steps the project involves.

The first part of the project that we have been focusing on so far, is selecting and preparing the images of the reference material. A reference material is a sample of error-free texts that is used to evaluate the amount of word and character errors in the automatically produced material. For our reference material we selected 400 digitized newspaper pages spread over 200 years. These pages were carefully chosen to reflect typical variations in layout and typography.

During the second part of the project an automatic document layout analysis was performed on each digitized page. During this process the structure of the page was identified on the grounds of the knowledge it contains, and each piece of knowledge was framed and numbered. An example of the result of the automatic layout analysis of one digitized page where 42 frames were identified is given below. For some pages the number of frames reached over 100. Altogether the sample contains 43.850 frames.

Inscannad tidning — A scanned newspaper page from December 1863 after automatic document layout analysis.

Last month, in August 2019, we send this material to a transcription company who specializes in double-keying. Once we have the manually transcribed texts for all pages, we will evaluate them against the automatically transcribed texts. We will apply two general approaches to evaluate the material: 1) quantitative analysis, performed automatically by machines, and 2) qualitative analysis, performed manually by human annotators. The results that we will obtain from the evaluation with respect to the most common transcription errors and the pages where these errors occur will help us to improve the automatic transcription process accordingly.

The challenges in correcting historical material are many. For some newspapers the quality of the paper and the print are so low that it becomes impossible to transcribe the text even for the human eye, as the example below illustrates. Notwithstanding, it would not be realistic to expect a complete error free material.

Inscannad text — Example of low quality of paper and print.

We do however anticipate an improvement of the majority of erroneous transcriptions in the Kubhist corpus. As a result, we will provide more accurate texts that are not only of interest to researchers but also to a wide range of individuals who are interested in knowing more about the Swedish history, language, and culture.

The Kubhist corpus of Swedish newspapers

Labels