Skip to main content

A free cloud service for OCR

Background

The project En fri molntjänst för OCR `A free cloud service for OCR', funded by the National Library of Sweden (51-KB709-2012), ran from 1st September, 2013 until 31st August, 2014. The project was a collaboration between the University Library of University of Gothenburg and Språkbanken at the Department of Swedish, Multilingualism, Language Technology of University of Gothenburg. 

Project description

It aims to create a prototype Optical Character Recognition (OCR) web service for processing old Swedish texts that are printed in a blackletter (fraktur) or roman typeface, using one of two open source OCR engines. Our ultimate goal is to provide a service for libraries, museums and archives to upload any digitized document and retrieve an OCRed text with high quality, independent on the quality of the print.

In the project, we have evaluated two open source OCR engines: OCRopus and Tesseract and further developed one of them, namely OCRopus. Using OCRopus we have set up a open webservice for OCR that can handle Swedish Blackletter print as well as Roman type print. The pilot cloud service and web API can be found here. The material and tools developed in the project are freely available for download from this site under the license CC-BY.

Institutes/organisations

Språkbanken

Universitetsbliblioteket

Data resources, models och tools

Material

Evaluation script

Extensions to OCRopus

Trained OCRopus character models for Swedish

Trained Tesseract character models for Swedish

Post-processing

Publications BibTeX

All: BibTeX

2016

Project duration

Project members

Funding

  • Kungliga biblioteket (51-KB709-2012)

Research topics

  • OCR
  • historiskt material

Project type

  • Research project
  • Externally funded

Umbrella project

No