Eukalyptus Treebank of Written Swedish

Standard reference

Yvonne Adesam, Gerlof Bouma, and Richard Johansson. 2015. Defining the Eukalyptus forest – the Koala treebank of Swedish. In Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015), pages 1–9, Vilnius, Lithuania. Linköping University Electronic Press, Sweden. Publication Bibtex

Data citation

Adesam, Yvonne, Bouma, Gerlof, & Johansson, Richard (2024). Eukalyptus Treebank of Written Swedish (updated: 2024-01-25). [Data set]. Enriched and distributed by Språkbanken. https://doi.org/10.23695/narz-e115

Additional ways to cite the dataset.

A treebank with written Swedish data, with parts-of-speech, TIGER-style syntax, multiword expressions and sense annotation

The Eukalyptus Treebank of Written Swedish is a 100 000 token manually annotated treebank, consisting of texts from five different genres: novels, wikipedia, blogs, europarl, and news and community information. The annotation consists of lemmas, word senses, and parts-of-speech (in part based on the SUC tagset) and syntactic structures (mainly based on MAMBA and SAG) which were developed together for the treebank. The Eukalyptus treebank was developed for evaluation purposes a part of the project Koala - Korps lingvistiska annotationer, att utveckla en infrastruktur för text-baserad forskning med högkvalitativa annotationer [Koala – Korp's linguistic annotations, developing an infrastructure for text-based research with high quality annotations] funded by Riksbankens Jubileumsfond (2014–2017; nr In13-0320:1 to Yvonne Adesam et al).

Annotation

Manually annotated with lemma, word senses, parts-of-speech, and syntactic structures.

Caveats

Known problems and future improvements
- Sources are only provided in the form "scrubbed" texts, not any original html/pdf files.
- Annotation guidelines currently missing, but see the Publications directory for discussion of the design of the treebank and its annotation layers.
- The treatment of "när" (when) is inconsistent.
- The lemmata sometimes indicate compound part boundaries using "|", but far from always. (Consider removing these before use.)
- Multiword sense annotation uses a system of indices to refer to the location of the other parts of the multiword. So we might have:
<s id="s1">
[8 tokens omitted...]
<t id="s1_t9" word="för" sense="för_ro_skull..1" .../>
<t id="s1_t10" word="ro" sense="för_ro_skull..1:9" .../>
<t id="s1_t11" word="skull" sense="för_ro_skull..1:9" .../>
[...]
</s>
where the notation "för_ro_skull..1:9" means that the Saldo sense identifier is för_ro_skull..1 and that this multiword starts at the ninth token in the s. These indexes are mostly correct when it comes to the sense-attribute, but in the sense_ann-attribute (see Documentation/annotation.txt), these are likely to be wrong, due to retokenization after the sense annotation phase.

References

Yvonne Adesam, Gerlof Bouma, Richard Johansson (2015): Defining the Eukalyptus forest – the Koala treebank of Swedish, in Proceedings of the 20th Nordic Conference of Computational Linguistics, NODALIDA 2015, May 11-13, 2015, Vilnius, Lithuania. Edited by Beáta Megyesi, pages 1-9

Download

File	Size	Modified	Licence
Eukalyptus-1.0.0.zip	4.58 MB	2024-01-25	CC-BY-SA-4.0
Eukalyptus-0.1.0.zip beta (zip)	3.66 MB	2024-01-25	Other
Eukalyptus-0.1.1.zip beta (zip)	3.8 MB	2024-01-25	Other
Eukalyptus-0.2.0.zip beta (zip)	4.19 MB	2024-01-25	Other

Eukalyptus Treebank of Written Swedish

Standard reference

Data citation

Annotation

Caveats

References

Download

Type

Language

Size

Keywords

Creators

Updated

Contact

DOI