Evaluation and refinement of an enhanced OCR-process for mass digitisation

Background

There are great expectations placed on the capacity of heritage institutions to make their collections available in digital format. Data driven research is becoming a key concept within the humanities and social sciences. Technologies for converting images to machine-readable text (such as Optical Character Recognition – OCR) play a fundamental part in making these resources available. As the increased reliance on digital resources is accompanied by new applications of digital technology there is clearly the need for an infrastructure for the production and dissemination of reliable text data.

Project description

The purpose of this project is to fine-tune and evaluate a test platform for OCR-production (henceforth referred to as the OCR-module) that was developed by Kungliga biblioteket (KB) in cooperation with the Norwegian software company Zissor in 2017. The module is designed to enable adjustment and control of some key parameters of the post-capture stage of the OCR-production – e.g. dictionaries and linguistic algorithms – to match typical features of the newspaper as a printed product, characteristics that in a historic perspective change over time, such as layout, typography, and language conventions.

Institutes/organizations

The National Library of Sweden (KB)
Zissor

Resources

Newspapers from 1818-1870 including images, manual transcriptions and OCRed data.
Newspapers from 1871-1906 including images, manual transcriptions and OCRed data.
Document and segment level annotations of newspapers from 1818-2018. Annotated by two annotators, annotations instruction (in Swedish).
Segmentation and annotation on line level is available for newspapers between 1818-1848.
Word lists we used to process the material with. See Readme file for an overview.

Models

Trained in Calamari OCR software. Available together with the test and training data. Described in Skelbye and Dannélls (2021)

Other

A blogg text about the material and how it was analyzed at Språkbankensbloggen
Earlier project on OCR: A free cloud service for OCR

Presentations

Digital Humanities in the Nordic Countries (DHN), 2019, University of Copenhagen. Evaluation and refinement of an enhanced OCR process for mass digitisation.
Språkbanken Text Spring Workshop, 2020, online event: Reference data for evaluation of OCR.
CLARIN Annual Conference 2020, online event: Evaluation of a Two-OCR engine Method: First Results on Digitized Swedish Newspapers Spanning over nearly 200 Years.
Språkbanken Text Autumn Workshop 2021, online event: En förbättrad OCR-process för KB:s massdigitalisering av dagstidningar.

Publikationer

2021

Dana Dannélls, Lars Björk, Ove Dirdal, Torsten Johansson (2021): A Two-OCR Engine Method for Digitized Swedish Newspapers, i Selected Papers from the CLARIN Annual Conference 2020, Linköping Electronic Conference Proceedings 180
Molly Skelbye, Dana Dannélls (2021): OCR Processing of Swedish Historical Newspapers Using Deep Hybrid CNN–LSTM Networks, i Proceedings of the International Conference on Recent Advances in Natural Language Processing, 1–3 September, 2021 / edited by Galia Angelova, Maria Kunilovskaya, Ruslan Mitkov, Ivelina Nikolova-Koleva
Dana Dannélls, Shafqat Virk (2021): A Supervised Machine Learning Approach for Post-OCR Error Detection for Historical Text, i Linköping Electronic Press Workshop and Conference Collection. Selected contributions from the Eighth Swedish Language Technology Conference (SLTC-2020), 25-27 November, 2020

2020

Dana Dannélls, Lars Björk, Ove Dirdal, Torsten Johansson (2020): Evaluation of a Two-OCR Engine Method: First Results on Digitized Swedish Newspapers Spanning over nearly 200 Years, i CLARIN Annual Conference 2020, (Virtual Event), 5-7 October, 2020. Book of Abstracts
Dana Dannélls, Persson Simon (2020): Supervised OCR Post-Correction of Historical Swedish Texts: What Role Does the OCR System Play?, i Proceedings of the Digital Humanities in the Nordic Countries, 5th Conference, Riga, Latvia, October 21-23, 2020 / edited by Sanita Reinsone, Inguna Skadiņa, Anda Baklāne, Jānis Daugavietis

2019

Dana Dannélls, Torsten Johansson, Lars Björk (2019): Evaluation and refinement of an enhanced OCR process for mass digitisation, i Proceedings of the Digital Humanities in the Nordic Countries 4th Conference (DHN 2019), Copenhagen, Denmark, March 5-8, 2019. Edited by: Costanza Navarretta, Manex Agirrezabal, Bente Maegaard
Brigitte Alfter (2019): Cross-border collaborative journalism