Evaluation and refinement of an enhanced OCR-process for mass digitisation

Den här sidan är inte översatt till svenska. Innehållet visas därför på engelska.


There are great expectations placed on the capacity of heritage institutions to
make their collections available in digital format. Data driven research is
becoming a key concept within the humanities and social sciences. Technologies
for converting images to machine-readable text (such as Optical Character
Recognition – OCR) play a fundamental part in making these resources available.
As the increased reliance on digital resources is accompanied by new
applications of digital technology there is clearly the need for an infrastructure
for the production and dissemination of reliable text data.

Project description

The purpose of this project is to fine-tune and evaluate a test
platform for OCR-production (henceforth referred to as the OCR-module) that
was developed by Kungliga biblioteket (KB) in cooperation with the Norwegian
software company Zissor in 2017. The module is designed to enable
adjustment and control of some key parameters of the post-capture stage of the
OCR-production – e.g. dictionaries and linguistic algorithms – to match typical
features of the newspaper as a printed product, characteristics that in a historic
perspective change over time, such as layout, typography, and language


The National Library of Sweden (KB)


Newspapers from 1818-1870 including images, manual transcriptions and OCRed data are freely available here

Read a related description about the material and how it was analyzed at Språkbankensbloggen




Visa alla publikationer som BibTeX