Available Analyses¶

This section provides an overview of some of the built-in analyses available within Sparv and the Sparv plugins developed by Språkbanken Text. Note that this is not an exhaustive list of available annotations but rather a summary of the linguistic analyses. Technical annotations (e.g., automatic assignment of IDs or calculation of whitespace information) are not included here. For a complete list of analyses, refer to the output of the sparv modules command.

Note

Annotations refer to the names of the annotations as they appear in the corpus config file under the export.annotations section (learn more in the corpus configuration section). Please note that the annotations usually have shorter names in the corpus exports.

Annotators are the names of the functions (including their module names) used to produce the annotations. These can be executed directly with the sparv run-rule [annotator] command. However, this is generally unnecessary, as they are automatically executed when running the sparv run command if their corresponding annotations are included in the corpus config file.

Analyses for contemporary Swedish¶

For analysing texts in contemporary Swedish we recommend using the annotation preset called SWE_DEFAULT.

Sentence segmentation with PunktSentenceTokenizer¶


Description	Texts are split into sentences.
Model	punkt-nltk-svenska.pickle trained on StorSUC
Method	The model is built with NLTK's PunktTrainer. The segmentation is done with NLTK's PunktSentenceTokenizer.
Annotations	`segment.sentence` (sentence segments)
Annotators	`segment:sentence`

Tokenization¶


Description	Sentence segments are split into tokens.
Model	- configuration file bettertokenizer.sv - word list `bettertokenizer.sv.saldo-tokens` built upon SALDOs morphology (it is built automatically by Sparv)
Method	Tokenizer using regular expressions and lists of words containing special characters and common abbreviations. Sparv's version is custom-made for Swedish but it is possible to configure it for other languages.
Annotations	`segment.token` (token segments)
Annotators	`segment:tokenize`

POS-tagging with Stanza¶


Description	Sentence segments are analysed to enrich tokens with part-of-speech tags and morphosyntactic information.
Tool	Stanza
Model	https://spraakbanken.gu.se/resurser/stanzamorph
Tagset	- SUC MSD tags - Universal features
Annotations	- `<token>:stanza.pos` (part-of-speech tag) - `<token>:stanza.msd` (morphosyntactic tag) - `<token>:stanza.ufeats` (universal features)
Annotators	`stanza:msdtag`

Translation from SUC to UPOS¶


Description	SUC part-of-speech tags are translated to UPOS. Not used by default because the translations are not very reliable.
Model	Method has no model. A translation table is used.
Tagset	Universal POS tags
Annotations	- `<token>:misc.upos` (universal part-of-speech tag)
Annotators	`misc:upostag`

POS-tagging with Hunpos¶


Description	Sentence segments are analysed to enrich tokens with part-of-speech tags and morphosyntactic information. No longer used by default because Stanza's POS-tagging yields better results.
Tool	Hunpos
Model	suc3_suc-tags_default-setting_utf8.model trained on SUC 3.0
Tagset	SUC MSD tags
Annotations	- `<token>:hunpos.msd` (morphosyntactic tag) - `<token>:hunpos.pos` (part-of-speech tag)
Annotators	- `hunpos:msdtag` - `hunpos:postag`

Dependency parsing with Stanza¶


Description	Sentence segments are analysed to enrich tokens with dependency information.
Tool	Stanza
Model	https://spraakbanken.gu.se/resurser/stanzasynt
Tagset	Mamba-Dep
Annotations	- `<token>:stanza.ref` (the token position within the sentence) - `<token>:stanza.dephead_ref` (dependency head, the ref of the word which the current word modifies or is dependent of) - `<token>:stanza.deprel` (dependency relation, the relation of the current word to its dependency head)
Annotators	- `stanza:dep_parse` - `stanza:make_ref`

Dependency parsing with MaltParser¶


Description	Sentence segments are analysed to enrich tokens with dependency information. No longer used by default because Stanza's dependency parsing yields better results.
Tool	MaltParser
Model	swemalt trained on Svensk trädbank
Tagset	Mamba-Dep
Annotations	- `<token>:malt.ref` (the token position within the sentence) - `<token>:malt.dephead_ref` (dependency head, the ref of the word which the current word modifies or is dependent of) - `<token>:malt.deprel` (dependency relation, the relation of the current word to its dependency head)
Annotators	- `malt:annotate` - `malt:make_ref`
Installation	See the installation and Setup section for more information.

Phrase structure parsing¶


Description	Mamba-Dep dependencies produced by the dependency analysis are converted to phrase structures.
Model	Method has no model.
Annotations	- `phrase_structure.phrase` (phrase segments) - `phrase_structure.phrase:phrase_structure.name` (name of the phrase segment) - `phrase_structure.phrase:phrase_structure.func` (function of the phrase segment)
Annotators	`phrase_structure:annotate`

Lexical SALDO-based analyses¶


Description	Tokens and their POS tags are looked up in the SALDO lexicon in order to enrich them with more information.
Model	SALDO morphology
Tagset	SALDO tags for lemgrams
Annotations	- `<token>:saldo.baseform` (lemma) - `<token>:saldo.lemgram` (lemgrams, identifying the inflectional table) - `<token>:saldo.sense` (identify senses in SALDO)
Annotators	`saldo:annotate`

Lemmatization with Stanza¶


Description	Sentence segments are analysed to enrich tokens with lemmas.
Tool	Stanza
Model	https://spraakbanken.gu.se/resurser/stanzasynt
Annotations	- `<token>:stanza.baseform` (lemma)
Annotators	`stanza:annotate_swe`

Sense disambiguation¶


Description	SALDO IDs from the `<token>:saldo.sense`-attribute are enriched with likelihoods.
Tool	Sparv wsd
Documentation	Running the Koala word sense disambiguators
Model	- ALL_512_128_w10_A2_140403_ctx1.bin - lem_cbow0_s512_w10_NEW2_ctx.bin
Annotations	- `<token>:wsd.sense` (identifies senses in SALDO along with their likelihoods)
Annotators	`wsd:annotate`
Installation	See the installation and Setup section for more information.

Compound analysis with SALDO¶


Description	Tokens and their POS tags are looked up in the SALDO lexicon in order to enrich them with compound information. More information (in Swedish) is found in the Språkbanken Text FAQ ("Hur fungerar Sparvs sammansättningsanalys?"). Lemmas are enriched in this analysis.
Model	- SALDO morphology - NST pronunciation lexicon for Swedish - word frequency statistics from Korp
Annotations	- `<token>:saldo.complemgram` (compound lemgrams including a comparison score) - `<token>:saldo.compwf` (compound word forms) - `<token>:saldo.baseform2` (lemma)
Annotators	`saldo:compound`

Sentiment analysis with SenSALDO¶


Description	Tokens and their SALDO IDs are looked up in SenSALDO in order to enrich them with sentiments.
Model	SenSALDO
Annotations	- `<token>:sensaldo.sentiment_label` (sentiment) - `<token>:sensaldo.sentiment_score` (sentiment value)
Annotators	`sensaldo:annotate`

Named entity recognition with HFST-SweNER¶


Description	Sentence segments are analysed and enriched with named entities.
Tool	hfst-SweNER
Model	included in the tool
referenser	- HFST-SweNER – A New NER Resource for Swedish - Reducing the effect of name explosion
Tagset	HFST-SweNER tags
Annotations	- `swener.ne` (named entity segment) - `swener.ne:swener.name` (text in the entire named entity segment) - `swener.ne:swener.ex` (named entity; name expression, numerical expression or time expression) - `swener.ne:swener.type` (named entity type) - `swener.ne:swener.subtype` (named entity subtype)
Annotators	`swener:annotate`
Installation	See the installation and Setup section for more information.

Readability metrics¶


Description	Documents are analysed in order to enrich them with readability metrics.
Model	Method has no model.
Annotations	- `<text>:readability.lix` (the Swedish readability metric LIX, läsbarhetsindex) - `<text>:readability.ovix` (the Swedish readability metric OVIX, ordvariationsindex) - `<text>:readability.nk` (the Swedish readability metric nominalkvot (noun ratio))
Annotators	- `readability:lix` - `readability:ovix` - `readability:nominal_ratio`

Lexical classes¶


Description	Tokens are looked up in Blingbring and SweFN in order to enrich them with information about their lexical classes. Documents are then enriched with information about lexical classes based on which classes are common for the tokens within them.
Model	- Blingbring - Swedish FrameNet (SweFN)
Annotations	- `<token>:lexical_classes.blingbring` (lexical class from the Blingbring resource per token - `<token>:lexical_classes.swefn` (frames from Swedish FrameNet (SweFN) per token - `<text>:lexical_classes.blingbring` (lexical class from the Blingbring resource per dokument) - `<text>:lexical_classes.swefn` (frames from Swedish FrameNet (SweFN) per dokument)
Annotators	`lexical_classes:blingbring_words` `lexical_classes:swefn_words` `lexical_classes:blingbring_text` `lexical_classes:swefn_text`

Geotagging¶


Description	Sentences (and paragraphs, if present) are enriched with place names (and their geographic coordinates) occurring within them. This is based on the place names found by the named entity tagger. Geographical coordinates are looked up in the GeoNames database.
Model	GeoNames
Annotations	- `<sentence>:geo.geo_context` (places and their coordinates occurring within the sentence) - `<paragraph>:geo.geo_context` (places and their coordinates occurring within the paragraph)
Annotators	`geo:contextual`

Analyses for Swedish from the 1800's¶

We recommend using the annotation preset called SWE_1800. All analyses available for contemporary Swedish can also be applied to 19th-century Swedish. Additionally, some analyses have been specifically adapted for this language variety:

POS-tagging with Hunpos (adapted for 1800-talssvenska)¶


Description	Sentence segments are analysed to enrich tokens with part-of-speech tags and morphosyntactic information.
Tool	Hunpos
Model	- suc3_suc-tags_default-setting_utf8.model trained on SUC 3.0 - a word list along with the words' morphosyntactic information generated from the Dalin morphology and the Swedberg morphology
Tagset	SUC MSD tags
Annotations	- `<token>:hunpos.msd` (morphosyntactic tag) - `<token>:hunpos.pos` (part-of-speech tag)
Annotators	- `hunpos:msdtag_hist` - `hunpos:postag`
Installation	See the installation and Setup section for more information.

Lexicon-based analyses¶


Description	Tokens and their POS tags are looked up in different lexicons in order to enrich them with more information.
Model	- SALDO morphology - Dalin morphology - Swedberg morphology - Diachronic pivot
Tagset	SALDO tags (for lemgrams)
Annotations	- `<token>:hist.baseform` (lemma) - `<token>:hist.sense` (identifies senses in SALDO) - `<token>:hist.lemgram` (lemgrams, identifying the inflectional table) - `<token>:hist.diapivot` (SALDO lemgrams from the diapivot model) - `<token>:hist.combined_lemgrams` (SALDO lemgram, combined from SALDO, Dalin, Swedberg and the diapivot model)
Annotators	- `hist:annotate_saldo` - `hist:diapivot_annotate` - `hist:combine_lemgrams`

Analyses for Old Swedish¶

We recommend using the annotation preset called SWE_FSV. All analyses for contemporary Swedish are available for this language variety. However, we do not recommend using them due to the fact that the spelling often differs too much to give satisfying results. At Språkbanken Text we use the following analyses for texts written in Old Swedish:

Sentence segmentation¶

Same analysis as for contemporary Swedish.

Tokenization¶

Same analysis as for contemporary Swedish.

Spelling variations¶


Description	Tokens are looked up in a model to get common spelling variations.
Model	model for Old Swedish spelling variations
Annotations	`<token>:hist.spelling_variants` (possible spelling variations for the token)
Annotators	`hist:spelling_variants`

Lexicon-based analyses¶


Description	Tokens and their POS tags are looked up in different lexicons in order to enrich them with more information.
Model	- Fornsvensk morphology from Söderwall and Schlyter - SALDO morphology - Diachronic pivot
Tagset	SALDO tags for lemgrams
Annotations	- `<token>:hist.baseform` (lemma) - `<token>:hist.lemgram` (lemgrams, identifying the inflectional table) - `<token>:hist.diapivot` (SALDO lemgrams from the diapivot model) - `<token>:hist.combined_lemgrams` (SALDO lemgram, combined from SALDO, Dalin, Swedberg and the diapivot model)
Annotators	- `hist:annotate_saldo_fsv` - `hist:diapivot_annotate` - `hist:combine_lemgrams`

Homograph sets¶


Description	A set of possible POS tags is extracted from the lemgram annotation.
Model	Method has no model.
Tagset	POS tags from the SUC MSD tag set
Annotations	`<token>:hist.homograph_set` (possible part-of-speech tags for the token)
Annotators	`hist:extract_pos`

Analyses for other languages than Swedish¶

Sparv supports analyses for a number of different languages. A list of which languages are supported and what analysis tools are available can be found in the installation and setup section.

Analyses from TreeTagger¶

We recommend using the annotation preset called TREETAGGER.


Description	Tokenised sentence segments are analysed to enrich tokens with more information.
Tool	TreeTagger
Model	Different language-dependent parameter files are used. Please check the TreeTagger web site for more information.
Tagset	- Different language-dependent POS tag sets are used. Please check the TreeTagger web page for more information. - Universal POS tags
Annotations	- `<token>:treetagger.baseform` (lemma) - `<token>:treetagger.pos` (part-of-speech tag, may include morphosyntactic information) - `<token>:treetagger.upos` (universal part-of-speech tags, translated from `<token>:treetagger.pos`)
Annotators	`treetagger:annotate`
Installation	See the installation and Setup section for more information.

Analyses from FreeLing¶

We recommend using the annotation preset called SBX_FREELING or SBX_FREELING_FULL (for languages supporting named entity recognition).


Description	Entire documents are analysed with FreeLing for sentence segmentation, tokenization and enrichment with other information. FreeLing does not use the same permissive licence as Sparv. Installation of the Sparv FreeLing plugin is necessary.
Tool	FreeLing
Model	Models for different languages are included in the tool.
Tagset	- Different language-dependent POS tagsets (often EAGLES). Please check the FreeLing documentation for more information. - Universal POS tags
Annotations	- `freeling.sentence` (sentence segments from FreeLing) - `freeling.token` (token segments from FreeLing) - `freeling.token:freeling.baseform` (lemma) - `freeling.token:freeling.pos` (part-of-speech tag, often including some morphosyntactic information) - `freeling.token:freeling.upos` (universal part-of-speech tags) - `freeling.token:freeling.ne_type` (named entity type (only available for some languages)
Annotators	`freeling:annotate` or `freeling:annotate_full` (depending on the language)
Installation	See the installation and Setup section for more information.

Analyses from Stanza (for English)¶

We recommend using the annotation preset called STANZA.


Description	Entire documents are analysed with Stanza for sentence segmentation, tokenization and enrichment with other information.
Tool	Stanza
Model	included in the tool
Tagset	- Universal POS tags - Universal features
Annotations	- `stanza.sentence` (sentence segments from Stanza) - `stanza.ne` (named entity segments from Stanza) - `stanza.ne:stanza.ne_type` (named entity type) - `stanza.token` (token segments from Stanza) - `<token>:stanza.baseform` (lemma) - `<token>:stanza.pos` (part-of-speech tag) - `<token>:stanza.upos` (universal part-of-speech tags) - `<token>:stanza.ufeats` (universal features) - `<token>:stanza.ref` (the token position within the sentence) - `<token>:stanza.dephead_ref` (dependency head, the ref of the word which the current word modifies or is dependent of) - `<token>:stanza.deprel` (dependency relation, the relation of the current word to its dependency head)
Annotators	- `stanza:annotate` - `stanza:make_ref`

Analyses from Stanford Parser (for English)¶

We recommend using the annotation preset called STANFORD.


Description	Entire documents are analysed with Stanford Parser for sentence segmentation, tokenization and enrichment with other information.
Tool	Stanford Parser
Model	included in the tool
Tagset	- Penn Treebank tagset - Universal POS tags
Annotations	- `stanford.sentence` (sentence segments from Stanford Parser) - `stanford.token` (token segments from Stanford Parser) - `stanford.token:stanford.baseform` (lemma) - `stanford.token:stanford.pos` (part-of-speech tag) - `stanford.token:stanford.upos` (universal part-of-speech tags) - `stanford.token:stanford.ne_type` (named entity type) - `stanford.token:stanford.ref` (the token position within the sentence) - `stanford.token:stanford.dephead_ref` (dependency head, the ref of the word which the current word modifies or is dependent of) - `stanford.token:stanford.deprel` (dependency relation, the relation of the current word to its dependency head)
Annotators	- `stanford:annotate` - `stanford:make_ref`
Installation	See the installation and Setup section for more information.