Models
The most accurate results for Swedish POS-tagging can be obtained with Flair. Stanza, however, yields very similar results, and has an advantage of being a pipeline that is also capable of dependency parsing. We provide two POS-tagging models for Stanza.
stanza_eval
is trained on SUC3 with Talbanken_SBX_dev as dev set. The advantage of this model is that it can be evaluated, using Talbanken_SBX_test or SIC2. The evaluation results are reported in the table below.
Test set | Exact match | POS | MSD |
---|---|---|---|
Talbanken_SBX_test | 0.973 | 0.983 | 0.988 |
SIC2 | 0.918 | 0.932 | 0.957 |
Read more about the evaluation here.
stanza_full
is trained on SUC3 + Talbanken_SBX_test + SIC2 with Talbanken_SBX_dev as dev set. We cannot evaluate the performance of this model, but we expect it to perform better than stanza_eval
, or at least not worse.
Tagging and training
Clone Stanza and install the necessary dependencies. We improved some of the shell scripts that are used to launch Stanza, and we strongly recommend that you download them from here and put them in stanza/scripts (replacing the original scripts if necessary).
Stanza was created for working with UD treebanks in the first place and it assumes that corpora names follow the UD conventions (even if they do not follow the UD annotation scheme). For this reason, your files have to be placed in the folder stanza/corpora/UD_Language-Treebank, where Language
is the language name and Treebank
is the treebank name (e.g. UD_Swedish-Talbanken). The files have to be named lang_treebank-ud-set.conllu, where lang
is a two-letter code for language (sv), and set
is train, dev or test (e.g. sv_talbanken-ud-train.conllu).
Use a Linux-like environment. GPU is strongly recommended.
Tagging
Unzip the model you want to use and the "pretrain" file (which contains word2vec embeddings encoded in a format required by Stanza). Place the two .pt files in stanza/saved_models/pos. Run bash scripts/pos.sh UD_Swedish-Talbanken
to tag a test set using a pretrained model. The output file will be created in the stanza/corpora folder. If you use other treebank name than UD_Swedish-Talbanken, you would have to rename the model files.
Training your own models
Run bash scripts/run_pos.sh UD_Swedish-Talbanken
. The instructions for using pretrained embeddings are provided here.