Skip to main content
Språkbanken Text is a department within Språkbanken.

Evaluation and refinement of an enhanced OCR-process for mass digitisation

Background

There are great expectations placed on the capacity of heritage institutions to make their collections available in digital format. Data driven research is becoming a key concept within the humanities and social sciences. Technologies for converting images to machine-readable text (such as Optical Character Recognition – OCR) play a fundamental part in making these resources available. As the increased reliance on digital resources is accompanied by new applications of digital technology there is clearly the need for an infrastructure for the production and dissemination of reliable text data.

Project description

The purpose of this project is to fine-tune and evaluate a test platform for OCR-production (henceforth referred to as the OCR-module) that was developed by Kungliga biblioteket (KB) in cooperation with the Norwegian software company Zissor in 2017. The module is designed to enable adjustment and control of some key parameters of the post-capture stage of the OCR-production – e.g. dictionaries and linguistic algorithms – to match typical features of the newspaper as a printed product, characteristics that in a historic perspective change over time, such as layout, typography, and language conventions.

Institutes/organizations

The National Library of Sweden (KB)
Zissor

Resources

Models

Other

Presentations

Publications BibTeX

2021 BibTeX

2020 BibTeX

2019 BibTeX

Project duration

Project members

  • Dana Dannélls
    dana.dannells@svenska.gu.se
  • Lars Björk (PI)
  • Torsten Johansson

Funding

Research topics

  • OCR
  • digital humanities
  • historiskt material
  • kulturarv
  • language technology

Project type

  • Research infrastructure project
  • Externally funded