Skip to main content
Språkbanken Text is a department within Språkbanken.

TalbankenSBX

Citation Information

Språkbanken Text (2017). TalbankenSBX (updated: 2017-06-07). [Data set]. Språkbanken Text. https://doi.org/10.23695/6m9r-w377
BibTeX Additional ways to cite the dataset.
Talbanken is a Swedish treebank. This is the Språkbanken Text version of Talbanken.

Talbanken is a widely used Swedish treebank, read more about its history and different versions here. This version originated as a copy of TalbankenSTB, but unlike the STB version, this one is open to changes and corrections. This is also the version indexed by our search engine Korp. The changes made by us can be found in changelog.txt.

Annotation

The following layers of annotation were added (or corrected) manually and can be considered gold data: tokenization, sentence segmentation, POS, MSD, dependency syntax (deprel and dephead).

Tokenization, sentence segmentation, POS and MSD follow the SUC format, syntactic annotation follows the Mamba-Dep format, a conversion of the MAMBA format used in the original Talbanken76 to dependency grammar.

Read more about these annotation layers in the documentation for TalbankenSTB or at Joakim Nivre's page: tokenization and sentence segmentation, POS and MSD, dependency syntax.

Formats and splits

TalbankenSBX is provided in our standard XML format and in a (pseudo-)CONLLU format, where UPOS is POS in the SUC format, XPOS is POS+MSD, Feats are MSD converted to the UD/CONLLU standard, and Deprel is a Mamba-Dep relation. There are currently no text and SpaceAfter attributes. You may convert our XML to this format Talbanken yourself using the script in this repository.

We provide two splits of TalbankenSBX. MorphSplit is used for POS-tagging purposes: the treebank is divided into two parts with the same number of sentences (the split is completely random, no blocks are used). One part is used as the development set, the other is the test set (SUC3 is the training set). You may resplit the Talbanken yourself using the script in this repository.

SyntSplit used is for dependency parsing: the treebank is divided into the training, development and test sets. The training set is the same as the one in TalbankenSTB, whereas dev and test approximate dev and test in the UD version as much as possible. The SyntSplit is provided only in the CONLLU format.

Accessible through

Download

File Size Modified Licence
1.54 MB 2017-06-07 CC BY 4.0
attribution
stats_TALBANKEN.txt
Word statistics: Information (CSV)
1.06 MB 2016-03-13 CC BY 4.0
attribution
changelog.txt
changelog.txt (txt)
316 bytes 2020-06-11 CC BY 4.0
attribution
TalbankenSBX_morphsplit20200610.zip
TalbankenSBX_morphsplit20200610.zip (zip)
3.64 MB 2020-06-11 CC BY 4.0
attribution
TalbankenSBX_syntsplit20200610.zip
TalbankenSBX_syntsplit20200610.zip (zip)
807.09 KB 2020-06-11 CC BY 4.0
attribution

Type

  • Corpus
  • Training and evaluation data

Language

Swedish

Size

Sentences: 6,160
Tokens: 96,346

Updated

2017-06-07

Contact

Språkbanken
sb-info@svenska.gu.se