There are great expectations placed on the capacity of heritage institutions to make their collections available in digital format. Data driven research is becoming a key concept within the humanities and social sciences. Technologies for converting images to machine-readable text (such as Optical Character Recognition – OCR) play a fundamental part in making these resources available. As the increased reliance on digital resources is accompanied by new applications of digital technology there is clearly the need for an infrastructure for the production and dissemination of reliable text data.
The purpose of this project is to fine-tune and evaluate a test platform for OCR-production (henceforth referred to as the OCR-module) that was developed by Kungliga biblioteket (KB) in cooperation with the Norwegian software company Zissor in 2017. The module is designed to enable adjustment and control of some key parameters of the post-capture stage of the OCR-production – e.g. dictionaries and linguistic algorithms – to match typical features of the newspaper as a printed product, characteristics that in a historic perspective change over time, such as layout, typography, and language conventions.
The National Library of Sweden (KB)
- Newspapers from 1818-1870 including images, manual transcriptions and OCRed data.
- Newspapers from 1871-1906 including images, manual transcriptions and OCRed data.
- Document and segment level annotations of newspapers from 1818-2018. Annotated by two annotators, annotations instruction (in Swedish).
- Segmentation and annotation on line level is available for newspapers between 1818-1848.
- Word lists we used to process the material with. See Readme file for an overview.
- Trained in Calamari OCR software. Available together with the test and training data. Described in Skelbye and Dannélls (2021)
- A blogg text about the material and how it was analyzed at Språkbankensbloggen
- Digital Humanities in the Nordic Countries (DHN), 2019, University of Copenhagen. Evaluation and refinement of an enhanced OCR process for mass digitisation.
- Språkbanken Text Spring Workshop, 2020, online event: Reference data for evaluation of OCR.
- CLARIN Annual Conference 2020, online event: Evaluation of a Two-OCR engine Method: First Results on Digitized Swedish Newspapers Spanning over nearly 200 Years.
- Språkbanken Text Autumn Workshop 2021, online event: En förbättrad OCR-process för KB:s massdigitalisering av dagstidningar.