The Eukalyptus Treebank of Written Swedish is a 100 000 token manually annotated treebank, consisting of texts from five different genres: novels, wikipedia, blogs, europarl, and news and community information. The annotation consists of lemmas, word senses, and parts-of-speech (in part based on the SUC tagset) and syntactic structures (mainly based on MAMBA and SAG) which were developed together for the treebank. The Eukalyptus treebank was developed for evaluation purposes a part of the project Koala - Korps lingvistiska annotationer, att utveckla en infrastruktur för text-baserad forskning med högkvalitativa annotationer [Koala – Korp's linguistic annotations, developing an infrastructure for text-based research with high quality annotations] funded by Riksbankens Jubileumsfond (2014–2017; nr In13-0320:1 to Yvonne Adesam et al).
Standard reference
Yvonne Adesam, Gerlof Bouma, and Richard Johansson. 2015. Defining the Eukalyptus forest – the Koala treebank of Swedish. In Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015), pages 1–9, Vilnius, Lithuania. Linköping University Electronic Press, Sweden. Publication Bibtex
Citation
Språkbanken Text (2024). Eukalyptus Treebank of Written Swedish (updated: 2024-01-25). [Data set]. Språkbanken Text. https://doi.org/10.23695/narz-e115Additional ways to cite the dataset.
A treebank with written Swedish data, with parts-of-speech, TIGER-style syntax, multiword expressions and sense annotation
Annotation
Manually annotated with lemma, word senses, parts-of-speech, and syntactic structures.
Caveats
- Known problems and future improvements
- Sources are only provided in the form "scrubbed" texts, not any original html/pdf files.
- Annotation guidelines currently missing, but see the Publications directory for discussion of the design of the treebank and its annotation layers.
- The treatment of "när" (when) is inconsistent.
- The lemmata sometimes indicate compound part boundaries using "|", but far from always. (Consider removing these before use.)
- Multiword sense annotation uses a system of indices to refer to the location of the other parts of the multiword. So we might have:
<s id="s1">
[8 tokens omitted...]
<t id="s1_t9" word="för" sense="för_ro_skull..1" .../>
<t id="s1_t10" word="ro" sense="för_ro_skull..1:9" .../>
<t id="s1_t11" word="skull" sense="för_ro_skull..1:9" .../>
[...]
</s>
where the notation "för_ro_skull..1:9" means that the Saldo sense identifier is för_ro_skull..1 and that this multiword starts at the ninth token in the s. These indexes are mostly correct when it comes to the sense-attribute, but in the sense_ann-attribute (see Documentation/annotation.txt), these are likely to be wrong, due to retokenization after the sense annotation phase.
References
Yvonne Adesam, Gerlof Bouma, Richard Johansson (2015): Defining the Eukalyptus forest – the Koala treebank of Swedish, in Proceedings of the 20th Nordic Conference of Computational Linguistics, NODALIDA 2015, May 11-13, 2015, Vilnius, Lithuania. Edited by Beáta Megyesi, pages 1-9
Download
File | Size | Modified | Licence |
---|---|---|---|
Eukalyptus-1.0.0.zip
other
(zip)
|
4.58 MB | 2024-01-25 |
CC BY-SA 4.0
attribution
|
Eukalyptus-0.1.0.zip
beta
(zip)
|
3.66 MB | 2024-01-25 |
Mixed
see license.txt
|
Eukalyptus-0.1.1.zip
beta
(zip)
|
3.8 MB | 2024-01-25 |
Mixed
see license.txt
|
Eukalyptus-0.2.0.zip
beta
(zip)
|
4.19 MB | 2024-01-25 |
Mixed
see license.txt
|