Swedish Treebank
The Swedish Treebank is a syntactically annotated corpus of Swedish, created by merging, harmonizing
and partially reannotating two existing corpora, Talbanken [1, 2]
and the Stockholm-Umeå Corpus (SUC) [3,4]. The Swedish Treebank
has been created through a collaboration between the Department of Linguistics and Philology at
Uppsala University and the School of Mathematics and Systems Engineering at Växjö University.
The treebank is distributed by Språkbanken at the
University of Gothenburg and is freely available for research and education but requires the user
to have a license for SUC 2.0.
Below we begin by describing the overall process of merging, harmonizing and reannotating the two source
corpora, and the way in which this process has determined properties of the synthesized treebank. We then
go on to describe the following aspects of the treebank and its annotation:
- Tokenization and sentence segmentation
- Morphological annotation (parts of speech and morphological features)
- Syntactic annotation (phrase structure and grammatical functions)
- Encoding format (TIGER-XML)
We conclude with acknowledgments and references.
Synthesizing the Swedish Treebank from Talbanken and SUC
Talbanken: Talbanken is a syntactically annotated corpus, containing both written and
spoken Swedish, produced in the 1970s at the Department of Scandinavian Languages,
Lund University, by a group led by Ulf Teleman. In total, the corpus contains about
350,000 tokens, divided into 200,000 tokens of written text (professional prose and
high school essays) and 150,000 tokens of spoken language (interviews, debates, and
informal conversations). The original annotation consists
of two layers: a lexical layer, with parts of speech and morphological features,
and a syntactic layer, with a relatively flat phrase structure and grammatical
functions.
SUC: SUC is a balanced corpus of written Swedish, modeled after the Brown
Corpus and similar corpora
for English, developed at Stockholm University and at Umeå University
in a project led by Gunnel Källgren and Eva Ejerhed.
The corpus consists of 1.2 million tokens of text from a variety of
different genres, the corpus encoding follows the guidelines of the
Text Encoding Initiative (TEI), and the annotation includes lemmatization,
parts of speech, morphological features, and named entities.
In order to merge and harmonize these two corpora into the
Swedish Treebank, we have adopted the following overall strategy:
- Harmonize tokenization and sentence segmentation by making Talbanken
conform to the principles of SUC.
- Replace the lexical annotation layer in Talbanken with a morphological
annotation according to the SUC guidelines.
- Convert the syntactic annotation layer in Talbanken to a more modern format
and annotate SUC according to the (converted) Talbanken guidelines.
The overall guiding principle has been to modify SUC as little as possible (given
that it is the larger corpus and also a de facto standard for Swedish) and to make
Talbanken conform to SUC instead of the other way round. The only place where this
is not possible is for the syntactic annotation layer, which is missing in SUC.
Version 1.0 of the Swedish Treebank includes all of SUC but only the professional
prose section of Talbanken. The annotation is limited to morphology (parts of speech +
morphological features) and syntax (phrase structure + grammatical functions).
The status of harmonization and manual revision is as follows:
- Tokenization and sentence segmentation in Talbanken has been modified to fit the principles of SUC.
- Morphological annotation has been manually checked and revised in both Talbanken and SUC.
- Syntactic annotation has been partially checked in Talbanken (after automatic conversion from
the old format) but not in SUC, where the syntactic annotation has been performed automatically using
a parser trained on Talbanken after conversion from the old format.
In the following three sections, we give a brief description of the guidelines for tokenization
and sentence segmentation, morphological annotation, and syntactic annotation, respectively.
Tokenization and Sentence Segmentation
Tokenization follows the principles of SUC. Words separated by
whitespace or punctuation in the original text are considered separate tokens, as are
punctuation marks. Exception is made for abbreviations containing punctuation and/or
whitespace, which are kept together as one token with whitespace replaced by an underscore,
e.g., t.ex.
and t_ex
.
Sentences are segmented according to the principles of SUC, where a sentence is treated as
the longest sequence of tokens between two major delimiters, defined as one of the punctuation
marks .
, ?
, !
, :
, or combinations thereof. In addition, list items are treated
as separate sentences.
Morphological Annotation
The morphological annotation consists of part-of-speech categories and morphological features, following
the principles of SUC. Guidelines for these categories can be found in [6], except for
the PL category (verb part), which was not part of the original system but is used in both releases
of the corpus. In this case, we have relied on the actual annotation in SUC 2.0 and on internal documentation
from the SUC project.
For further information about the morphological annotation, we refer to [6,7].
Syntactic Annotation
The syntactic annotation of each sentence takes the form of a constituent structure, where
constituents are labeled with structural categories (phrase types),
while edges connecting constituents are labeled with functional categories
(grammatical functions) indicating the role of the lower constituent within
the higher. The set of structural categories used is a small set of conventional
phrase types, such as S for sentence/clause, NP for noun phrase, VP for verb
phrase, etc. The set of functional categories is inherited from the MAMBA
annotation scheme with a small extension for structures that were not annotated
in the original version of Talbanken.
For a detailed description of the functional categories inherited from MAMBA,
we refer to [4].
Encoding Format
Version 1.0 of the Swedish Treebank is encoded in TIGER-XML and is limited to two layers of annotation, the morphological and
syntactic layers, as described above. Future releases of the treebank are likely to use standoff annotation
and include additional annotation layers from the source treebanks. One advantage of the TIGER-XML format
is that it supports easy browsing using TIGERSearch, a GUI-based tool with advanced search facilities.
Acknowledgments
We gratefully acknowledge the work done by the original creators of Talbanken at Lund University
[1, 2, 5, 8,
9, 10, 11]
and of SUC at Stockholm University and Umeå University [6, 3, 5, 7],
without which the Swedish Treebank
clearly would not have existed at all. The work on synthesizing the treebank has been carried out by
Joakim Nivre, Beáta Megyesi, Sofia Gustafson-Capková, Filip Salomonsson, Bengt Dahlqvist,
and Anna Sågvall Hein at Uppsala University and by Johan Hall and Jens Nilsson at Växjö
University. Finally, we want to thank Lars Borin and his team at Språkbanken for their help in
distributing the Swedish Treebank.
References
- Einarsson, Jan. 1976. Talbankens skriftspråkskonkordans. Lund University:
Department of Scandinavian Languages.
- Einarsson, Jan. 1976. Talbankens talspråkskonkordans. Lund University:
Department of Scandinavian Languages.
- Stockholm-Umeå Corpus SUC 1.0. 1996. Stockholm University:
Department of Linguistics and University of Umeå: Department of linguistics.
- Stockholm-Umeå Corpus SUC 2.0. 2006. Stockholm University:
Department of Linguistics.
- Teleman, Ulf. 1974. Manual för grammatisk beskrivning av talad och
skriven svenska. Studentlitteratur.
- Ejerhed, Eva; Källgren, Gunnel; Wennstedt, Ola and Åström, Magnus. 1992. The Linguistic Annotation System of the Stockholm-Umeå Corpus Project. Report No 33. University of Umeå: Department of Linguistics.
Department of Scandinavian Languages.
- Källgren, Gunnel. 2006. Documentation of the Stockholm - Umeå Corpus. In: Manual of the Stockholm Umeå Corpus version 2.0. Sofia Gustafson-Capková and Britt Hartmann (eds). Stockholm University: Department of Linguistics.
- Margareta Westman. 1974. Bruksprosa. Liber.
- Nils Jörgensen. 1976. Meningsbyggnaden i talad svenska. Studentlitteratur.
- Tor G. Hultman and Margareta Westman. 1977. Gymnasistsvenska. Liber.
- Jan Einarsson. 1978. Talad och skriven svenska. Lund University:
Department of Scandinavian Languages.