Skip to main content

SIC2 - Stockholm Internet Corpus

The Stockholm Internet Corpus (SIC2) contains Swedish blog posts, annotated with part of speech, morphological features, and named entities.

The Stockholm Internet Corpus 2 (SIC2) contains Swedish blog posts, annotated with part of speech, morphological features, and named entities. Annotation was done by Robert Östling, Johan Sjons and Johannes Bjerva. Version 2 was created by Aleksandrs Berdicevskis by making minor changes in the annotation and the format (see below). The original version 1 can be found here. Version 2 uses an extended CoNLL-U format. See more in the readme. The corpus is distributed under the Creative Commons Attribution-ShareAlike 3.0 Unported license.

File Storlek Modified Licence
sic2.xml.bz2
corpus (XML)
262.36 KB 2020-11-25 CC BY 4.0
attribution
stats_sic2.csv
token frequencies (CSV)
177.44 KB 2021-08-12 CC BY 4.0
attribution
sic2.zip
corpus (XML)
CC BY 4.0
attribution
readme.txt
readme (txt)
2.18 KB 2020-11-17 CC BY 4.0
attribution

Type

  • Corpus
  • Training and evaluation data

Language

Swedish

Tokens

13,562

Sentences

892

Contact

Språkbanken
sb-info@svenska.gu.se