We provide a model that enables lemmatization of Swedish text following the SUC3 standard. Note that SUC3 lemmatization does not exactly match the SALDO standard that is used in our Korp resources.
SUC3 was randomly split into training, validation and test sets (80:10:10). The model was trained for 30 epochs using the default Stanza settings. The accuracy on the test set is 99.18.
Lemmatizing and training
Clone Stanza and install the necessary dependencies. We improved some of the shell scripts that are used to launch Stanza, and we strongly recommend that you download them from here and put them in stanza/scripts (replacing the original scripts if necessary).
Stanza was created for parsing UD treebanks in the first place and it assumes that corpora names follow the UD conventions (even if they do not follow the UD annotation scheme). For this reason, your files have to be placed in the folder stanza/corpora/UD_Language-Treebank, where
Language is the language name and
Treebank is the treebank name (e.g. UD_Swedish-Suc). The files have to be named lang_treebank-ud-set.conllu, where
lang is a two-letter code for language (sv), and
set is train, dev or test (e.g. sv_suc-ud-train.conllu).
Use a Linux-like environment. GPU is strongly recommended.
Unzip the model and place the .pt file in stanza/saved_models/lemma. Run
bash scripts/lemma.sh UD_Swedish-Suc to lemmatize a test set using a pretrained model. The output file will be created in the stanza/corpora folder.
Training your own models
bash scripts/run_lemma.sh UD_Swedish-Suc gold.