Skip to main content
Språkbanken Text is a part of Språkbanken.

Lemmatization model: Stanza

Citation Information

Språkbanken Text (2020). Lemmatization model: Stanza (updated: 2020-11-19). [Data set]. Språkbanken Text. https://doi.org/10.23695/681b-be74
BibTeX Additional ways to cite the dataset.
Pretrained model for lemmatization.

Models

We provide a model that enables lemmatization of Swedish text following the SUC3 standard. Note that SUC3 lemmatization does not exactly match the SALDO standard that is used in our Korp resources.

SUC3 was randomly split into training, validation and test sets (80:10:10). The model was trained for 30 epochs using the default Stanza settings. The accuracy on the test set is 99.18.

Lemmatizing and training

<

p>Clone Stanza and install the necessary dependencies. We improved some of the shell scripts that are used to launch Stanza, and we strongly recommend that you download them from here and put them in stanza/scripts (replacing the original scripts if necessary). Stanza was created for parsing UD treebanks in the first place and it assumes that corpora names follow the UD conventions (even if they do not follow the UD annotation scheme). For this reason, your files have to be placed in the folder stanza/corpora/UD_Language-Treebank, where Language is the language name and Treebank is the treebank name (e.g. UD_Swedish-Suc). The files have to be named lang_treebank-ud-set.conllu, where lang is a two-letter code for language (sv), and set is train, dev or test (e.g. sv_suc-ud-train.conllu). Use a Linux-like environment. GPU is strongly recommended.

<

p>

Lemmatizing

Unzip the model and place the .pt file in stanza/saved_models/lemma. Run bash scripts/lemma.sh UD_Swedish-Suc to lemmatize a test set using a pretrained model. The output file will be created in the stanza/corpora folder.

Training your own models

<

p>Run bash scripts/run_lemma.sh UD_Swedish-Suc gold.

<

p>

Download

File Size Modified Licence
lem_stanza.zip
lem_stanza.zip (zip)
3.74 MB 2020-11-19 CC BY 4.0
attribution

Type

  • Model

Language

Swedish

Size

Updated

2020-11-19

Contact

Språkbanken
sb-info@svenska.gu.se