Hoppa till huvudinnehåll
Språkbanken Text är en avdelning inom Språkbanken.

Dependensparsningsmodell: Stanza

Datacitering Information

Språkbanken Text (2020). Dependensparsningsmodell: Stanza (uppdaterad: 2020-12-09). [Data set]. Språkbanken Text. https://doi.org/10.23695/wh3y-2y24
BibTeX Ytterligare sätt att citera datamängden.
Förtränade modeller för dependensparsning.

Models

We provide two models that enable dependency parsing of Swedish (in the Mamba-Dep format, the format of TalbankenSBX).

stanza_eval is trained on Talbanken_SBX_train with as Talbanken_SBX_dev as dev set and evaluated using Talbanken_SBX_test. The evaluation results are reported in the table below. The LAS (when trained with gold POS and MSD tags) is 84.48. We used the Word2Vec embeddings trained on the CONLL17 corpus (using Word2Vec trained on a Göteborgs-Posten corpus yields a very similar result of 84.43, see more about embeddings here).

stanza_full is trained on Talbanken_SBX_train + Talbanken_SBX_dev with Talbanken_SBX_test as dev set. We cannot evaluate the performance of this model, but we expect it to perform better than stanza_eval, or at least not worse.

Parsing and training

Clone Stanza and install the necessary dependencies. We improved some of the shell scripts that are used to launch Stanza, and we strongly recommend that you download them from here and put them in stanza/scripts (replacing the original scripts if necessary). Stanza was created for parsing UD treebanks in the first place and it assumes that corpora names follow the UD conventions (even if they do not follow the UD annotation scheme). For this reason, your files have to be placed in the folder stanza/corpora/UD_Language-Treebank, where Language is the language name and Treebank is the treebank name (e.g. UD_Swedish-Talbanken). The files have to be named lang_treebank-ud-set.conllu, where lang is a two-letter code for language (sv), and set is train, dev or test (e.g. sv_talbanken-ud-train.conllu). Use a Linux-like environment. GPU is strongly recommended.

Parsing

Unzip the model you want to use and the "pretrain" file (which contains word2vec embeddings encoded in a format required by Stanza). Place the two .pt files in stanza/saved_models/depparse. Run bash scripts/parse.sh UD_Swedish-Talbanken to parse a test set using a pretrained model. The output file will be created in the stanza/corpora folder. If you use other treebank name than UD_Swedish-Talbanken, you would have to rename the model files. The script assumes that the POS tags are already present in the test set.

Training your own models

Run bash scripts/run_depparse.sh UD_Swedish-Talbanken gold. Replace gold with predicted if you want to predict POS tags and not use the gold ones. This command assumes that a pretrained part-of-speech model is available, you can find one here. The instructions for using pretrained embeddings are provided here.

Fil Storlek Modifierad Licens
synt_stanza_eval.zip
synt_stanza_eval.zip (zip)
99.05 MB 2020-12-09 CC BY 4.0
attribution
synt_stanza_full2.zip
synt_stanza_full2.zip (zip)
99.17 MB 2020-12-09 CC BY 4.0
attribution
stanza_pretrain.zip
stanza_pretrain.zip (zip)
105.77 MB 2020-12-09 CC BY 4.0
attribution

Typ

  • Modell

Språk

svenska

Storlek

Updaterad

2020-12-09

Kontakt

Språkbanken
sb-info@svenska.gu.se