Hoppa till huvudinnehåll

SIC2 - Stockholm Internet Corpus

Korpusen Stockholm Internet Corpus (SIC2) innehåller svenska bloggar som är annoterade med ordklasstaggar, morfologiska särdrag och namnentiteter.

The Stockholm Internet Corpus 2 (SIC2) contains Swedish blog posts, annotated with part of speech, morphological features, and named entities. Annotation was done by Robert Östling, Johan Sjons and Johannes Bjerva. Version 2 was created by Aleksandrs Berdicevskis by making minor changes in the annotation and the format (see below). The original version 1 can be found here. Version 2 uses an extended CoNLL-U format. See more in the readme. The corpus is distributed under the Creative Commons Attribution-ShareAlike 3.0 Unported license.

Fil Storlek Modifierad Licens
sic2.xml.bz2
corpus Information (XML)
262.36 KB 2020-11-25 CC BY 4.0
attribution
stats_sic2.csv
Ordstatistik: Information (CSV)
177.44 KB 2021-08-12 CC BY 4.0
attribution
sic2.zip
corpus Information (XML)
CC BY 4.0
attribution
readme.txt
readme (txt)
2.18 KB 2020-11-17 CC BY 4.0
attribution

Typ

  • Korpus
  • Tränings- och utvärderingsdata

Språk

svenska

Storlek

Meningar: 892
Token: 13 562

Kontakt

Språkbanken
sb-info@svenska.gu.se