Hoppa till huvudinnehåll
Språkbanken Text är en avdelning inom Språkbanken.

Ordklasstaggningsmodell: Marmot

Datacitering Information

Språkbanken Text. (2020-06-29). Ordklasstaggningsmodell: Marmot [Data set]. Språkbanken Text. https://doi.org/10.23695/aryw-nh78
BibTeX Ytterligare sätt att citera datamängden.
Förtränade modeller för ordklasstaggning.

Model

The most accurate results for Swedish POS-tagging can be obtained with Flair. Marmot, however, yields very similar results, but works much faster and does not require a GPU. We provide two models for Marmot.

marmot_eval is trained on SUC3 and Talbanken_SBX_dev, using Saldo as dictionary. The advantage of this model is that it can be evaluated, using Talbanken_SBX_test or SIC2. The evaluation results are reported in the table below.

Test set Exact match POS MSD
Talbanken_SBX_test 0.973 0.982 0.988
SIC2 0.921 0.934 0.958

Read more about the evaluation here.

marmot_full is trained on SUC3 + Talbanken_SBX_test + Talbanken_SBX_dev + SIC2 (with Saldo as dictionary). We cannot evaluate the performance of this model, but we expect it to perform better than marmot_eval, or at least not worse.

Tagging and training

Download Marmot and the necessary dependencies. Download SALDO (converted to the necessary format) here. Download our scripts from this repository. The scripts use a tab-separated three-column format: token, POS (without MSD), MSD. Use conllu_to_tab.rb to convert CONLL(U) to the two-column format (install Ruby 1.9+ and run ruby conllu_to_tab 2 n, where n is the number of the column you want to use (if you are converting our CONLLU files, use 4). Run ruby convert_col2_to_marmot.rb to convert the resulting col2 file to Marmot's col3).

Tagging

Use java -cp marmot.jar marmot.morph.cmd.Annotator --model-file model_name.marmot --test-file form-index=0,test_corpus.col1 --pred-file output_name.conll to tag a corpus using a pretrained model. The output corpus will be in a CONLL format with a somewhat unusual order of columns, use convert_marmot_to_conllu.rb to convert it to a usual CONLLU.

Training your own models

Run Marmot: java -Xmx5G -cp marmot.jar marmot.morph.cmd.Trainer -train-file form-index=0,tag-index=1,morph-index=2,corpus.col3 -tag-morph true -model-file model_name.marmot subtag-separator "." -type-dict saldo_marmot.txt,indexes=[2,3]

Fil Storlek Modifierad Licens
marmot_eval.marmot
marmot_eval.marmot (marmot)
108.59 MB 2020-06-29 CC BY 4.0
attribution
marmot_full.marmot
marmot_full.marmot (marmot)
113.41 MB 2020-06-29 CC BY 4.0
attribution
saldo_marmot.txt
saldo_marmot.txt (txt)
46.33 MB 2020-06-29 CC BY 4.0
attribution

Typ

  • Modell

Språk

svenska

Storlek

Updaterad

2020-06-29

Kontakt

Språkbanken
sb-info@svenska.gu.se