The corpus collection Kubhist is a result of the project Digidaily which was run between 2010 and 2014.
The collection is split into parts, by source periodical and decade.
The corpus collection Kubhist is a result of the project Digidaily which was run between 2010 and 2014.
The collection is split into parts, by source periodical and decade.
The data in Kubhist is enriched with some metadata, such as the price of an issue, printing location, periodicity, political tendency and publication timespan. This information is collected from the Nya Lundstedt dagstidningar database at the National Library of Sweden.
Text quality is highly variable, due in part to uneven printing and stains on the OCRed originals. Many issues open with strange character sequences resulting from the OCR attempting to interpret title ornaments as text. The digital text gets readable as soon as it reaches the part of the page where the articles start.
Joakim Lilljegren (2018): Introduktion till Språkbankens historiska material i Korp, pages 1–23