Hoppa till huvudinnehåll

TalbankenSBX

Talbanken är en svensk trädbank. Detta är Språkbanken Texts version av Talbanken.

Talbanken is a widely used Swedish treebank, read more about its history and different versions here. This version originated as a copy of TalbankenSTB, but unlike the STB version, this one is open to changes and corrections. This is also the version indexed by our search engine Korp. The changes made by us can be found in changelog.txt.

Annotation

The following layers of annotation were added (or corrected) manually and can be considered gold data: tokenization, sentence segmentation, POS, MSD, dependency syntax (deprel and dephead).

Tokenization, sentence segmentation, POS and MSD follow the SUC format, syntactic annotation follows the Mamba-Dep format, a conversion of the MAMBA format used in the original Talbanken76 to dependency grammar.

Read more about these annotation layers in the documentation for TalbankenSTB or at Joakim Nivre's page: tokenization and sentence segmentation, POS and MSD, dependency syntax.

Formats and splits

TalbankenSBX is provided in our standard XML format and in a (pseudo-)CONLLU format, where UPOS is POS in the SUC format, XPOS is POS+MSD, Feats are MSD converted to the UD/CONLLU standard, and Deprel is a Mamba-Dep relation. There are currently no text and SpaceAfter attributes. You may convert our XML to this format Talbanken yourself using the script in this repository.

We provide two splits of TalbankenSBX. MorphSplit is used for POS-tagging purposes: the treebank is divided into two parts with the same number of sentences (the split is completely random, no blocks are used). One part is used as the development set, the other is the test set (SUC3 is the training set). You may resplit the Talbanken yourself using the script in this repository.

SyntSplit used is for dependency parsing: the treebank is divided into the training, development and test sets. The training set is the same as the one in TalbankenSTB, whereas dev and test approximate dev and test in the UD version as much as possible. The SyntSplit is provided only in the CONLLU format.

Fil Storlek Modifierad Licens
1.54 MB 2017-06-07 CC BY 4.0
attribution
stats_TALBANKEN.txt
Ordstatistik: Information (CSV)
1.06 MB 2016-03-13 CC BY 4.0
attribution
changelog.txt
changelog.txt (txt)
316 byte 2020-06-11 CC BY 4.0
attribution
TalbankenSBX_morphsplit20200610.zip
TalbankenSBX_morphsplit20200610.zip (zip)
3.64 MB 2020-06-11 CC BY 4.0
attribution
TalbankenSBX_syntsplit20200610.zip
TalbankenSBX_syntsplit20200610.zip (zip)
807.09 KB 2020-06-11 CC BY 4.0
attribution

Typ

  • Korpus
  • Tränings- och utvärderingsdata

Språk

svenska

Storlek

Meningar: 6 160
Token: 96 346

Kontakt

Språkbanken
sb-info@svenska.gu.se