The five lives of Talbanken

Inlagt av Aleksandrs (Sasha) Berdicevskis 2020-06-09

This post is about Talbanken, one of the most widely used and important Swedish corpora. There exist at least five versions of this treebank, and the purpose of this post is to reduce ambiguity of the name "Talbanken", which sometimes leads to confusion. I am going to list the five versions, explain the basic differences between them and suggest unambiguous version names.

1. Talbanken76: this is the original version, constructed at Lund University in the 1970s by Jan Einarsson et al. (and stored initially on punch cards). The treebank had around 300K tokens and was annotated in a format called Mamba, an eclectic combination of phrase-structure grammar, Diderichsen's sentence schema and traditional dependency-like approach.

2. Talbanken05: this is a modernization of the original version, created around 2005 at Vaxjö University by Joakim Nivre, Jens Nilsson and Johan Hall. It is available in four different formats: the original Mamba (with some corrections, all of which are documented), two phrase-structure formats and a dependency format. The newest subversion is 1.1. Talbanken05 can be downloaded from here: https://jnivre.github.io/files/talbanken05/index.html.

Note that the two versions listed above (76 and 05) consist of four sections: professional prose (P), high-school students' essays (G), interviews (IB) and conversation and debates (SD). The three versions listed below (STB, SBX, UD) consist of only one section: P.

3. TalbankenSTB, where STB stands for The Swedish Treebank: this version, created by Joakim Nivre et al. at Uppsala University, is a subcorpus of Talbanken05. It contains the professional prose (P) section, around 95K tokens. It is annotated using a dependency format that I label here Mamba-Dep, a conversion of the original Mamba. In addition, tokenization and sentence segmentation have been modified to fit the principles of the Stockholm-Umeå Corpus (SUC), and morphological annotation in the SUC format (the de-facto standard format for Swedish) has been added. TalbankenSTB is now being officially distributed by us, Språkbanken Text. It is also available here: https://cl.lingfil.uu.se/~nivre/swedish_treebank/. The newest subversion is 1.2. There is a train-test split.

4. TalbankenSBX, where SBX stands for Språkbanken Text: this version was initially a copy of TalbankenSTB. The difference is that we preserve TalbankenSTB exactly as it was when Språkbanken received it (for reproducibility's sake), while in TalbankenSBX we make changes (correct errors etc.) when we deem them necessary. The newest version of TalbankenSBX can be downloaded here: https://spraakbanken.gu.se/resurser/talbanken. There is currently no version numbering, but we will log the changes. This is also the version that is included in our search engine Korp. There are two splits of this corpus:

MorphSplit, created for the purposes of POS tagging, where we split TalbankenSBX into dev and test sets (training set is SUC3).
SyntSplit, created for the purposes of dependency parsing, where we split TalbankenSBX into train, dev and test. Train is the same as train in TalbankenSTB, whereas dev and test approximate dev and test in the UD version (see below) as much as possible.

5. TalbankenUD, where UD stands for Universal Dependencies: this version was created by Joakim Nivre et al. at Uppsala University by converting the TalbankenSTB subcorpus into the UD format. Many errors and omissions were manually corrected. Note that TalbankenUD is not fully isomorphic to STB and SBX: there is different sentence segmentation and different number of tokens (since some of the tokens omitted in STB/SBX were restored). The UD version is being constantly updated, all the subversions can be downloaded either via UD main page (https://universaldependencies.org/) or here: https://github.com/UniversalDependencies/UD_Swedish-Talbanken. There is a train-dev-test split.

Etiketter