Skip to main content

Eukalyptus Treebank of Written Swedish

A treebank with written Swedish data, with parts-of-speech, TIGER-style syntax, multiword expressions and sense annotation
The Eukalyptus Treebank of Written Swedish is a 100 000 token manually annotated treebank, consisting of texts from five different genres: novels, wikipedia, blogs, europarl, and news and community information. The annotation consists of lemmas, word senses, and parts-of-speech (in part based on the SUC tagset) and syntactic structures (mainly based on MAMBA and SAG) which were developed together for the treebank. The Eukalyptus treebank was developed for evaluation purposes a part of the project Koala - Korps lingvistiska annotationer, att utveckla en infrastruktur för text-baserad forskning med högkvalitativa annotationer [Koala – Korp's linguistic annotations, developing an infrastructure for text-based research with high quality annotations] funded by Riksbankens Jubileumsfond (2014–2017; nr In13-0320:1 to Yvonne Adesam et al).

Annotation

Manually annotated with lemma, word senses, parts-of-speech, and syntactic structures.

Caveats

* Known problems and future improvements
- Sources are only provided in the form "scrubbed" texts, not any original html/pdf files.
- Annotation guidelines currently missing, but see the Publications directory for discussion of the design of the treebank and its annotation layers.
- The treatment of "när" (when) is inconsistent.
- The lemmata sometimes indicate compound part boundaries using "|", but far from always. (Consider removing these before use.)
- Multiword sense annotation uses a system of indices to refer to the location of the other parts of the multiword. So we might have:
<s id="s1">
[8 tokens omitted...]
<t id="s1_t9" word="för" sense="för_ro_skull..1" .../>
<t id="s1_t10" word="ro" sense="för_ro_skull..1:9" .../>
<t id="s1_t11" word="skull" sense="för_ro_skull..1:9" .../>
[...]
</s>
where the notation "för_ro_skull..1:9" means that the Saldo sense identifier is för_ro_skull..1 and that this multiword starts at the ninth token in the s. These indexes are mostly correct when it comes to the sense-attribute, but in the sense_ann-attribute (see Documentation/annotation.txt), these are likely to be wrong, due to retokenization after the sense annotation phase.

References

File Size Modified Licence
4.58 MB 2024-01-25 CC BY-SA 4.0
attribution
3.66 MB 2024-01-25 Mixed
see license.txt
3.8 MB 2024-01-25 Mixed
see license.txt
4.19 MB 2024-01-25 Mixed
see license.txt

Type

  • Corpus
  • Training and evaluation data

Language

Swedish

Size

Tokens: 99,913

Keywords

  • treebank
  • contemporary

Contact

Språkbanken
sb-info@svenska.gu.se