Hoppa till huvudinnehåll

Evaluation and refinement of an enhanced OCR-process for mass digitisation


There are great expectations placed on the capacity of heritage institutions to
make their collections available in digital format. Data driven research is
becoming a key concept within the humanities and social sciences. Technologies
for converting images to machine-readable text (such as Optical Character
Recognition – OCR) play a fundamental part in making these resources available.
As the increased reliance on digital resources is accompanied by new
applications of digital technology there is clearly the need for an infrastructure
for the production and dissemination of reliable text data.

Project description

The purpose of this project is to fine-tune and evaluate a test
platform for OCR-production (henceforth referred to as the OCR-module) that
was developed by Kungliga biblioteket (KB) in cooperation with the Norwegian
software company Zissor in 2017. The module is designed to enable
adjustment and control of some key parameters of the post-capture stage of the
OCR-production – e.g. dictionaries and linguistic algorithms – to match typical
features of the newspaper as a printed product, characteristics that in a historic
perspective change over time, such as layout, typography, and language


The National Library of Sweden (KB)


Newspapers from 1818-1870 including images, manual transcriptions and OCRed data are freely available here

Read a related description about the material and how it was analyzed at Språkbankensbloggen

Publikationer BibTeX

2021 BibTeX

2020 BibTeX

2019 BibTeX



  • Dana Dannélls
  • Lars Björk (PI)
  • Torsten Johansson



  • OCR
  • digital humanities
  • historiskt material
  • kulturarv
  • language technology


  • Forskningsinfrastrukturprojekt
  • Externt finansierat