Språkbanken

Main title

Title of first section

1 December, 2008 | Lars Borin

SUC PS tree

The Swedish treebank now being made available in a preview version for evaluation, has come about through work by researchers in the Universities at Uppsala (Computational Linguistics, Department of Linguistics and Philology) and Växjö (The Language Technology research group in the School of Mathematics and Systems Engineering). The treebank is the result of the harmonization of the linguistic information in two existing Swedish language resources:

 

  1. Talbanken, a corpus of Swedish written and transcribed spoken language from the 1970s, manually annotated with syntactic information according to a traditional Scandinavian analysis tradition
  2. SUC (Stockholm Umeå Corpus), a morphosyntactically annotated (all corpus words are tagged with part of speech and lemma), balanced corpus of published Swedish written language from the 1990s

The harmonization process in brief has been that Talbanken has been annotated with the morphosyntactic tags used in SUC in a semiautomatic process, and both Talbanken and SUC have been automatically syntactically annotated with a phrase structure version of Talbanken's original syntax analysis. This means that we can expect errors in the syntactic annotation, particularly in SUC. A preliminary evaluation of the annotation, presented at a post-conference workshop at SLTC 2008, shows that the syntactic annotation is still very useful in corpus-linguistic investigations.

Read more »

Second section title

1 December, 2008 | Lars Borin

Format

The Swedish treebank is distributed in the TIGER-XML format, so that the freely available TIGERSearch tool can be used with it. TIGERSearch can be downloaded from Institut für Maschinelle Sprachverarbeitung at the Universitety of Stuttgart.

Read more »

License

The treebank part - i.e., the added syntactic annotations - of the Swedish treebank, is free, under an open source license.

Talbanken is freely available for research and education purposes.

SUC requires that each user signs an individual license agreement with the Department of Linguistics, Stockholm University. As of 1st December, 2008, licensing of SUC is entrusted to Språkbanken, University of Gothenburg. The license text can be downloaded in pdf format here.

Distribution

The Swedish treebank is distributed by Språkbanken, University of Gothenburg.

If you have a SUC license already, you will get downloading instructions and password from us. Others will first need to sign a SUC license agreement (see above).