Skip to main content
Språkbanken Text is a department within Språkbanken.

SUCX 3.0

Citation Information

Språkbanken Text (2024). SUCX 3.0 (updated: 2024-06-03). [Data set]. Språkbanken Text. https://doi.org/10.23695/9c9f-6132
BibTeX Additional ways to cite the dataset.
Stockholm-Umeå corpus 3.0 scrambled

The Stockholm-Umeå Corpus (SUC) is a collection of Swedish texts from the 1990's, consisting of one million words in total. The corpus is balanced, meaning that it contains various text types and stylistic levels. The texts are annotated with part-of-speech tags, morphological analysis and lemma (all that can be considered gold standard data), as well as some structural and functional information.

Version 1.0 was developed in co-operation between Gunnel Källgren at Stockholm University and Eva Ejerhed at Umeå University and was made available in 1997 by the department of linguistics at Stockholm University.

Version 2.0 was made available in 2006 by Sofia Gustafsson Capkova and Britt Hartmann at the department of linguistics at Stockholm University. It contains the same texts as SUC 1.0 but is extended with some annotation. Additionally, SUC 2.0 contains bonus materials. TigerSUC is SUC 2.0 converted to TIGER-XML by Martin Volk. StorSUC is additional SUC material of four million words.

Version 3.0 is available since 2012. It contains improved annotations, and unannotated texts with seven million words. (For the TigerXML-version, Suc2c, Suc2d, and the DTDs we still refer to version 2.0.)

Additional information about the compilation and annotation of SUC can be found in the SUC 2.0 manual [PDF].

Språkbanken distributes SUC 2.0 and SUC 3.0 in two variations:

  • SUC 2.0 and SUC 3.0: freely available for research; require a signed licence
  • SUCX 2.0 and SUCX 3.0: sentences in scrambled order; enriched with automatic annotations; downloadable without restrictions

SUCX 3.0

SUCX can be downloaded directly under the open licence CC BY-SA, below. The order of the sentences in this version has been scrambled, and extra annotation has been added automatically by Språkbanken's processing pipeline. The corpus is distributed in Språkbanken's default XML format.

The following annotation is taken from the official version:

  • Part of speech (pos attributes of word elements)
  • Morphology (msd attributes)
  • Lemma (lemma attributes)
  • Named entity (SUC 3.0 only; <name> tags, not the <ne> tags)

All other annotation, like the linking against Saldo, the dependency parses, and alternative named entity annotation (<ne> tags), was created automatically by Sparv.

SUCX can also be used in Korp.

File Size Modified Licence
suc3.xml.bz2
this file contains a scrambled version of the corpus Information (XML)
84.44 MB 2024-06-03 CC BY-SA 4.0
attribution
stats_suc3.csv
Word statistics: Information (CSV)
7.7 MB 2024-03-28 CC BY 4.0
attribution

Type

  • Corpus

Language

Swedish

Size

Sentences: 74,245
Tokens: 1,166,593

Updated

2024-06-03

Contact

Språkbanken Text
sb-info@svenska.gu.se