This reference material is part of the so called Kubhist corpus of Swedish newspapers. It has been prepared as a part of the RJ financed project "Evaluation and refinement of an enhanced OCR-process for mass digitisation" in cooperation with Kungliga biblioteket (KB) and the Norwegian software company Zissor.

The material was first digitized at KB. Second, it was automatically processed using advanced document layout analysis where each section in the digitized page was framed and numbered. Then, each section was processed with Abbyy FineReader version 11. Finally the material was manually transcribed by a transcription company who specializes in double-keying.

This particular subset contains a selection of newspapers from 1871-1906, one newspaper for each year. For each newspaper only the second and fourth pages were processed. 

37 newspapers, 45,445 segments and 337,635 words. 

The folder "svenska-tidningar-1871-1906" contains four sub-folders: 
(1) images_jpg: The scanned digitized pages.   
(2) images_pdf: The results from the document layout analysis on the digitized pages.
(3) ocr_abbyy: OCR results preformed with Abbyy FineReader version 11.  
(4) ocr_tesseract: OCR results preformed with Tesseract 4.0.  
(5) ocr_abbyy-tesseract: OCR results preformed with the combined Abbyy-Tesseract module.   
(6) transcribed: Manual transcription of the digitized pages.    
(7) data-annotations: the list of the annotation tags 

References:

Dana Dannélls, Torsten Johansson, Lars Björk (2019): Evaluation and refinement of an enhanced OCR process for mass digitisation. In Proceedings of the Digital Humanities in the Nordic Countries 4th Conference (DHN 2019), Copenhagen, Denmark, March 5-8, 2019. Edited by: Costanza Navarretta, Manex Agirrezabal, Bente Maegaard. 

Dana Dannélls, Lars Björk, Ove Dirdal, and Torsten Johansson (2021). A two-OCR engine method for digitized Swedish newspapers. In selected Papers from the CLARIN Annual Conference,pages 65–73, Linköping University Electronic Press. 

---

Released 2021-12-07 by Dana Dannélls 
Updated 2022-05-03
