Eukalyptus skriven svenska

Standardreferens

Yvonne Adesam, Gerlof Bouma, and Richard Johansson. 2015. Defining the Eukalyptus forest – the Koala treebank of Swedish. In Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015), pages 1–9, Vilnius, Lithuania. Linköping University Electronic Press, Sweden. Publication Bibtex

Datacitering

Adesam, Yvonne, Bouma, Gerlof, & Johansson, Richard (2024). Eukalyptus skriven svenska (uppdaterad: 2024-01-25). [Data set]. Bearbetad och distribuerad av Språkbanken. https://doi.org/10.23695/narz-e115

Ytterligare sätt att citera datamängden.

En trädbank som innnehåller skriven svenska, uppmärkt med ordklasser, syntax i stil med TIGER-trädbanken, flerordsenheter och ordbetydelser

The Eukalyptus Treebank of Written Swedish is a 100 000 token manually annotated treebank, consisting of texts from five different genres: novels, wikipedia, blogs, europarl, and news and community information. The annotation consists of lemmas, word senses, and parts-of-speech (in part based on the SUC tagset) and syntactic structures (mainly based on MAMBA and SAG) which were developed together for the treebank. The Eukalyptus treebank was developed for evaluation purposes a part of the project Koala - Korps lingvistiska annotationer, att utveckla en infrastruktur för text-baserad forskning med högkvalitativa annotationer [Koala – Korp's linguistic annotations, developing an infrastructure for text-based research with high quality annotations] funded by Riksbankens Jubileumsfond (2014–2017; nr In13-0320:1 to Yvonne Adesam et al).

Annotation

Manually annotated with lemma, word senses, parts-of-speech, and syntactic structures.

Förbehåll

Known problems and future improvements
- Sources are only provided in the form "scrubbed" texts, not any original html/pdf files.
- Annotation guidelines currently missing, but see the Publications directory for discussion of the design of the treebank and its annotation layers.
- The treatment of "när" (when) is inconsistent.
- The lemmata sometimes indicate compound part boundaries using "|", but far from always. (Consider removing these before use.)
- Multiword sense annotation uses a system of indices to refer to the location of the other parts of the multiword. So we might have:
<s id="s1">
[8 tokens omitted...]
<t id="s1_t9" word="för" sense="för_ro_skull..1" .../>
<t id="s1_t10" word="ro" sense="för_ro_skull..1:9" .../>
<t id="s1_t11" word="skull" sense="för_ro_skull..1:9" .../>
[...]
</s>
where the notation "för_ro_skull..1:9" means that the Saldo sense identifier is för_ro_skull..1 and that this multiword starts at the ninth token in the s. These indexes are mostly correct when it comes to the sense-attribute, but in the sense_ann-attribute (see Documentation/annotation.txt), these are likely to be wrong, due to retokenization after the sense annotation phase.

Referenser

Yvonne Adesam, Gerlof Bouma, Richard Johansson (2015): Defining the Eukalyptus forest – the Koala treebank of Swedish, i Proceedings of the 20th Nordic Conference of Computational Linguistics, NODALIDA 2015, May 11-13, 2015, Vilnius, Lithuania. Edited by Beáta Megyesi, sida 1-9

Ladda ned

Fil	Storlek	Modifierad	Licens
Eukalyptus-1.0.0.zip	4.58 MB	2024-01-25	CC-BY-SA-4.0
Eukalyptus-0.1.0.zip beta (zip)	3.66 MB	2024-01-25	Other
Eukalyptus-0.1.1.zip beta (zip)	3.8 MB	2024-01-25	Other
Eukalyptus-0.2.0.zip beta (zip)	4.19 MB	2024-01-25	Other

Standardreferens

Datacitering

Annotation

Förbehåll

Referenser

Ladda ned

Typ

Språk

Storlek

Nyckelord

Skapad av

Uppdaterad

Kontakt

DOI