Hoppa till huvudinnehåll
Språkbanken Text är en avdelning inom Språkbanken.

Eukalyptus skriven svenska

Standardreferens Information

Yvonne Adesam, Gerlof Bouma, and Richard Johansson. 2015. Defining the Eukalyptus forest – the Koala treebank of Swedish. In Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015), pages 1–9, Vilnius, Lithuania. Linköping University Electronic Press, Sweden. Publication Bibtex

Citering Information

Språkbanken Text (2024). Eukalyptus skriven svenska (uppdaterad: 2024-01-25). [Data set]. Språkbanken Text. https://doi.org/10.23695/narz-e115
BibTeX Ytterligare sätt att citera datamängden.
En trädbank som innnehåller skriven svenska, uppmärkt med ordklasser, syntax i stil med TIGER-trädbanken, flerordsenheter och ordbetydelser

The Eukalyptus Treebank of Written Swedish is a 100 000 token manually annotated treebank, consisting of texts from five different genres: novels, wikipedia, blogs, europarl, and news and community information. The annotation consists of lemmas, word senses, and parts-of-speech (in part based on the SUC tagset) and syntactic structures (mainly based on MAMBA and SAG) which were developed together for the treebank. The Eukalyptus treebank was developed for evaluation purposes a part of the project Koala - Korps lingvistiska annotationer, att utveckla en infrastruktur för text-baserad forskning med högkvalitativa annotationer [Koala – Korp's linguistic annotations, developing an infrastructure for text-based research with high quality annotations] funded by Riksbankens Jubileumsfond (2014–2017; nr In13-0320:1 to Yvonne Adesam et al).

Annotation

Manually annotated with lemma, word senses, parts-of-speech, and syntactic structures.

Förbehåll

  • Known problems and future improvements
    - Sources are only provided in the form "scrubbed" texts, not any original html/pdf files.
    - Annotation guidelines currently missing, but see the Publications directory for discussion of the design of the treebank and its annotation layers.
    - The treatment of "när" (when) is inconsistent.
    - The lemmata sometimes indicate compound part boundaries using "|", but far from always. (Consider removing these before use.)
    - Multiword sense annotation uses a system of indices to refer to the location of the other parts of the multiword. So we might have:
    <s id="s1">
    [8 tokens omitted...]
    <t id="s1_t9" word="för" sense="för_ro_skull..1" .../>
    <t id="s1_t10" word="ro" sense="för_ro_skull..1:9" .../>
    <t id="s1_t11" word="skull" sense="för_ro_skull..1:9" .../>
    [...]
    </s>
    where the notation "för_ro_skull..1:9" means that the Saldo sense identifier is för_ro_skull..1 and that this multiword starts at the ninth token in the s. These indexes are mostly correct when it comes to the sense-attribute, but in the sense_ann-attribute (see Documentation/annotation.txt), these are likely to be wrong, due to retokenization after the sense annotation phase.

Referenser

Ladda ned

Fil Storlek Modifierad Licens
4.58 MB 2024-01-25 CC BY-SA 4.0
attribution
3.66 MB 2024-01-25 Mixed
see license.txt
3.8 MB 2024-01-25 Mixed
see license.txt
4.19 MB 2024-01-25 Mixed
see license.txt

Typ

  • Korpus
  • Tränings- och utvärderingsdata

Språk

svenska

Storlek

Token: 99 913

Nyckelord

  • treebank
  • contemporary

Updaterad

2024-01-25

Kontakt

Språkbanken
sb-info@svenska.gu.se