Corpus Configuration¶
To process a corpus with Sparv, you need to create a corpus configuration file. This file, written in YAML, provides essential information about your corpus and instructs Sparv on how to process it. The corpus config wizard can assist you in generating this file. Alternatively, you can refer to the example corpora for inspiration.
A minimal config file contains a list of automatic annotations you want to be included in the output. Below is an example of a small config file:
metadata:
# Language of the source files
language: swe
export:
# Automatic annotations to include in the export
annotations:
- <sentence>:misc.id
- <token>:saldo.baseform
- <token>:hunpos.pos
- <token>:sensaldo.sentiment_label
Note
In Sparv and throughout this documentation, configuration keys are often written in dot notation (e.g.,
export.annotations
). This notation indicates that annotations
is a key nested within the export
section, as
seen in the example above.
Note
Most processors (like annotators or importers) in Sparv have adjustable options that can be fine-tuned within
the configuration file. Each module has its own dedicated section, like metadata
and export
in the example above.
To view all available configuration keys and their descriptions, use the sparv modules
command.
Config Schema¶
Running sparv schema
will generate a JSON schema that in many text editors can be used to validate your config file as
you are creating it. In some editors, this schema also enables autocompletion for configuration options.
The schema is printed to the console, but you can save it to a file by running sparv schema > schema.json
.
Corpus Config Wizard¶
The corpus config wizard is a tool that helps you create a config file by guiding you through questions about your
corpus and desired annotations. Run sparv wizard
to start the tool. If a config file already exists in the directory,
the wizard will read it and set default values based on the current configuration.
The wizard is a helpful starting point for creating a config file but does not cover all of Sparv's advanced features. However, you can of course manually edit a wizard-generated config file to add advanced configurations, such as custom annotations or headers.
Default Values¶
Some configuration variables, like metadata
, classes
, import
, export
, and custom_annotations
, are general and
used across multiple Sparv modules. Others are specific to individual annotation modules (e.g., hunpos.binary
specifies the binary used by the Hunpos module to perform part-of-speech tagging). These module-specific options
typically have default values defined within each module.
When you run Sparv, it reads your corpus configuration file and combines it with both Sparv's default configuration
file (config/config_default.yaml
in the Sparv data directory) and the
module-specific default values. To view the final, combined configuration, you can use the command sparv config
. This
command also allows you to check individual config variables or sections, such as sparv config metadata.language
or
sparv config metadata
. All default values can be overridden in your corpus configuration file.
Certain configuration options must be defined, either by Sparv's default config file or your corpus config:
metadata.language
(default:swe
)import.importer
(default:xml_import:parse
)export.annotations
classes.token
(default:segment.token
)classes.sentence
(default:segment.sentence
)
Metadata Options¶
The metadata
section in your corpus configuration file provides metadata about your corpus, which may be utilized by
various Sparv modules.
-
metadata.id
: Specifies a machine-readable name for the corpus, required by some export modules. This string may contain ASCII letters, digits, and dashes. -
metadata.name
: An optional human-readable name for the corpus. This field is divided into one or more subsections to specify the name in different languages. Use ISO 639-3 language codes to indicate the language, e.g.,metadata.name.eng
for the English name of the corpus. -
metadata.language
: Indicates the language of the source files in the corpus, using an ISO 639-3 code. If omitted, it defaults toswe
. Runsparv languages
to see a list of supported languages and their corresponding codes. -
metadata.variety
: An optional field to specify the language variety of the source files, if applicable. Runsparv languages
to view supported language varieties. -
metadata.description
: An optional description of the corpus, which may span multiple lines. Likemetadata.name
, this field is divided into language-specific subsections using ISO 639-3 language codes. -
metadata.short_description
: An optional short description of the corpus, spanning one line. Likemetadata.name
, this field is divided into language-specific subsections using ISO 639-3 language codes.
Import Options¶
The import
section of your corpus config provides Sparv with details about your source files (i.e., your corpus).
-
import.source_dir
: Specifies the location of your source files and defaults tosource
. Sparv will search this directory recursively for valid files to process. -
import.importer
: Defines which importer Sparv should use when processing your files. Which importer you should use depends on the format of your source files. For XML files, usexml_import:parse
(the default setting), and for plain text files, usetext_import:parse
. Runsparv modules
to see what other importers are available. -
import.text_annotation
: Identifies the existing annotation representing one text. Any automatic text-level annotations will be attached to this annotation. For XML files, this corresponds to an XML element in your source files; for unstructured import formats, like plain text and PDF, a defaulttext
root annotation is created automatically, so no further configuration is needed.Note
This setting automatically sets the
text
class. If you want to use an automatic annotation as the text annotation, you should not use this setting, and instead set thetext
class directly. -
import.encoding
: Specifies the source file encoding, with a default of UTF-8. -
import.normalize
: Normalizes Unicode symbols in the input, using any of the following forms:NFC
,NFKC
,NFD
, orNFKD
. Defaults toNFC
. -
import.keep_control_chars
: Set totrue
to retain control characters in the text. Generally, this should remain disabled (the default setting) to avoid potential issues.
Each importer may have additional options. Use sparv modules --importers
to view these. For instance, the XML importer
offers options to skip the content of specific elements and provides fine-grained control over XML header imports. For
more details about XML-specific options, run sparv modules xml_import
. See also the section on headers
and XML namespaces further down.
Export Options¶
The export
section of your corpus config specifies how the output data should be structured and what it should
contain.
Annotations from the Source Data¶
The export.source_annotations
option allows you to select which existing annotations in your source files should be
included in the output (only applicable if your input is in a format that supports structured data, such as XML). By
default, all existing annotations, such as XML elements and attributes, are included in the output.
-
When specifying elements using this option, ensure the root element (i.e., the element that encompasses all other elements and text content) is included. Omitting the root element will result in invalid XML output, preventing Sparv from generating XML files.
-
To exclude all source annotations, set this option to
[]
. Note that this may cause XML export errors, as a root element is required.
You can rename elements and attributes present in your input data. For example, if your files contain elements like
<article name="Scandinavian Furniture" date="2020-09-28">
and you want them to appear as <text title="Scandinavian
Furniture" date="2020-09-28">
in the output, you can use the following syntax:
export:
source_annotations:
- article as text
- article:name as title
- ...
The ellipsis (...
) in the example above represents all other elements and attributes in your input data that are not
explicitly listed. Omitting the ellipsis would result in the "date" attribute being excluded from the output.
If you want to keep most of the markup of your input data but exclude some elements or attributes, you can use the not
keyword:
export:
source_annotations:
- not date
In the example above, this configuration would produce the following output: <article name="Scandinavian Furniture">
.
If source_annotations
contains only annotations prefixed with not
, all other annotations are automatically
included by default, without needing to use the ellipsis described above.
Automatic Annotations Generated by Sparv¶
The export.annotations
option specifies the list of automatic annotations that Sparv should generate and include in
the output. To see which annotations are available, run sparv modules --annotators
. Some annotations use curly
brackets, such as {annotation}:misc.id
, indicating a wildcard that must be replaced with a specific value in the
export.annotations
list (e.g., <sentence>:misc.id
). You may also use annotation presets.
If you need to produce multiple output formats with different annotations, you can override the
export.source_annotations
and export.annotations
options for specific exporter modules. For instance, use
xml_export.source_annotations
and xml_export.annotations
for XML export, and csv_export.source_annotations
and
csv_export.annotations
for CSV export. Generally, the values from the export
section are used by default unless
overridden at the exporter module level.
Tip
If two or more sections of your configuration are identical, such as the list of annotations to include in different export formats, instead of copying and pasting, you can use YAML anchors.
Tip
You can convert a structural attribute to a token attribute, which is useful for representing structural
information (like named entities or phrase structures) in non-structured formats (e.g., CSV export). Use the
annotation <token>:misc.from_struct_{struct}_{attr}
, replacing {struct}
and {attr}
with the structural
annotation and attribute names respectively (e.g., <token>:misc.from_struct_swener.ne_swener.type
).
Default Export Formats¶
The export.default
option specifies the list of export formats produced when running sparv run
without format
arguments. By default, this list includes only xml_export:pretty
, which is the formatted XML export with one token per
line. Use sparv run --list
to see all available export formats.
Naming Options¶
Several export options relate to the naming of annotations and attributes. You can prefix all annotations produced by
Sparv with a custom prefix using the export.sparv_namespace
option. Similarly, you can add a prefix to all annotations
and attributes from your source files using the export.source_namespace
option.
Note
Despite the name of the options, these are not real XML namespaces but merely prefixes added to annotation names.
The export.remove_module_namespaces
option is true
by default, meaning module name prefixes are removed during
export. Turning this option off will result in output like:
<segment.token stanza.pos="IN" saldo.baseform="|hej|">Hej</segment.token>
instead of the more compact:
<token pos="IN" baseform="|hej|">Hej</token>
Scrambling and Anonymisation¶
The export.scramble_on
setting is used by all export formats that support scrambling. It controls which annotation
your corpus will be scrambled on (by default, no scrambling is performed). Typical settings are export.scramble_on:
<sentence>
or export.scramble_on: <paragraph>
. For example, setting this to <paragraph>
would lead to all
paragraphs being randomly shuffled in the export, while the sentences and tokens within the paragraphs keep their
original order.
The export.word
option defines the strings to be output as tokens in the export. By default, this is set to
<token:word>
. A useful application for this setting is anonymisation of texts. If you want to produce XML containing
only annotations but not the actual text, you could set export.word: <token>:anonymized
to get output like this:
<sentence id="b1ac">
<token pos="IN">***</token>
<token pos="MAD">*</token>
</sentence>
Note
For technical reasons, the xml_export:preserved_format
export does not respect this setting. The preserved
format XML will always contain the original corpus text.
Each exporter may have additional options which can be listed with sparv modules --exporters
.
Headers¶
In some cases, corpus metadata in XML is stored within header elements rather than as attributes of text-enclosing elements. Sparv can extract this information from headers and store it as annotations, which can then be used as input for various analyses. Additionally, header information can be exported as attributes if desired.
Consider the following example of a corpus file:
<text id="1">
<header>
<author birth="1780" death="????">Anonym</author>
<date>2020-09-08</date>
<title>
<main-title>A History of Corpora</main-title>
<sub-title>A Masterpiece</sub-title>
</title>
</header>
<another-header>
<dummy>1</dummy>
</another-header>
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do
eiusmod tempor incididunt ut labore et dolore magna aliqua.
</text>
In this example, we want to retain the data within the <header>
element but exclude its contents from being analyzed
as corpus text. Instead, we want to attach this metadata to the <text>
element. Additionally, we want to completely
remove the <another-header>
element and its contents.
The following configuration achieves this:
xml_import:
header_elements:
- header
- another-header
header_data:
- header/author as text:author
- header/author:birth as text:author-birth
- header/author:death as text:author-death
- header/title/main-title as text:title
- header/title/sub-title as text:subtitle
- header/date as text:date
xml_export:
header_annotations:
- not header
- not another-header
With this configuration, the output will look like this:
<text author="Anonym" author-birth="1780" author-death="????" date="2020-09-08"
id="1" title="A History of Corpora" subtitle="A Masterpiece">
<sentence id="13f">
<token>Lorem</token>
<token>ipsum</token>
...
</sentence>
</text>
If you want to keep the headers in the output without analyzing them as corpus text, simply list them without the not
prefix in xml_export.header_annotations
. If you don't specify anything in xml_export.header_annotations
, all headers
will be retained by default.
XML Namespaces¶
When working with XML source data that includes namespaces, Sparv attempts to preserve these namespaces in the XML output. However, there are two key limitations to be aware of:
- Namespace declarations are always placed in the root element of the output, regardless of their location in the source data.
- URIs and prefixes are assumed to be unique. A URI will be associated with the first prefix declared for that URI in the source file.
To refer to elements or attributes with namespaces in the corpus configuration file, use a special syntax. This syntax
consists of the namespace prefix followed by a +
, and then the tag or attribute name. For example, the reference for
the element <sparv:myelement xmlns:sparv="https://spraakbanken.gu.se/verktyg/sparv">
would be sparv+myelement
.
If you prefer to remove namespaces during import, set xml_import.remove_namespaces
to true
in the corpus
configuration. Be cautious, as this may lead to collisions between attributes with the same name but different
namespaces.
Annotation Classes¶
Annotation classes allow you to refer to annotations without specifying the module that produces them. For example,
<token>
and <sentence>
are annotation classes that by default refer to segment.token
and segment.sentence
,
respectively.
These classes simplify dependency management between modules and enhance the pipeline's flexibility. For instance, a
part-of-speech tagger that requires tokenized text as input probably doesn't care which tokenizer produced the tokens,
so it can simply request <token>
as input.
As a user, you can also benefit from annotation classes. In most parts of the config file where annotations are
referenced, you can use classes instead. To see the available classes, run the command sparv classes
. Classes are
denoted by enclosing the class name in angle brackets, like <token>
.
If multiple modules can produce annotations of the same class, you can specify which one to use in the classes
section
of your config file. Sparv provides default class settings, but you can override them as needed.
To illustrate how you can use classes in your configuration file, consider the following example:
If you want to include part-of-speech tags from Stanza, you could add segment.token:stanza.pos
to the export
annotations list. This indicates that you want part-of-speech tags from Stanza as attributes for the tokens produced by
the segment
module.
export:
annotations:
- segment.token:stanza.pos
- segment.token:malt.deprel
- segment.token:malt.dephead_ref
The drawback of this notation is that if you decide to switch your tokenizer to a different one called
my_new_segmenter
, you would need to update all instances of segment.token
in your configuration to
my_new_segmenter.token
. Instead of doing this, you can redefine your token class by setting classes.token
to
my_new_segmenter.token
and use the <token>
class in your annotations list:
classes:
token: my_new_segmenter.token
export:
annotations:
- <token>:stanza.pos
- <token>:malt.deprel
- <token>:malt.dephead_ref
Another instance where you might need to explicitly redefine classes is when your source files include annotations (such
as sentences or tokens) that should be used as input for annotators. For instance, if you have manually segmented
sentences and enclosed each sentence in an <s>
element, you can bypass Sparv's automatic sentence segmentation by
setting the sentence class to this element:
classes:
sentence: s
xml_import:
elements:
- s
This works because annotators requiring sentence annotations as input rely on the <sentence>
class, rather than being
tied to a specific module's sentence annotation.
Attention
To be able to use annotations from your source data as input for annotators, Sparv first needs to be informed about
their existence. For XML data, this is achieved by listing the relevant elements in the xml_import.elements
section
of your configuration file, as shown in the example above.
Note that only annotations intended as input for annotators need to be listed in xml_import.elements
, not when
merely passing them through to the output.
Annotation Presets¶
Annotation presets are predefined collections of annotations that can simplify your configuration. Instead of listing each annotation individually, you can use a preset to include all related annotations at once. For example, instead of specifying each SALDO annotation separately like this:
export:
annotations:
- <token>:saldo.baseform2 as baseform
- <token>:saldo.lemgram
- <token>:wsd.sense
- <token>:saldo.compwf
- <token>:saldo.complemgram
You can use the SWE_DEFAULT.saldo
preset to include all SALDO annotations with a single line:
export:
annotations:
- SWE_DEFAULT.saldo
Here, SWE_DEFAULT.saldo
will expand to include all the SALDO annotations. You can mix presets with individual
annotations and combine different presets as needed.
Sparv comes with a set of default presets. To see which presets are available for your corpus and the annotations they
include, run the sparv presets
command. Presets are defined in YAML files located in the Sparv data
directory under config/presets
. You can add your own presets by
adding YAML files to this directory.
It is possible to exclude specific annotations from a preset using the not
keyword. In the following example, all
SALDO annotations are included except for the compound analysis attributes:
export:
annotations:
- SWE_DEFAULT.saldo
- not <token>:saldo.compwf
- not <token>:saldo.complemgram
Note
Preset files may define their own default class
values. These will be set automatically when using a preset. You can
override these in your config files if necessary.
Parent Configuration¶
When managing multiple corpora with similar configurations, you can streamline the process by using a parent
configuration file. This is particularly useful when only a few variables, such as the corpus ID, differ between
corpora. To do this, specify the path to the parent configuration file in the parent
variable of your individual
corpus config files. Your corpus configuration will then inherit all parameters from the parent file, except those
explicitly defined in the individual config file.
parent: ../parent-config.yaml
metadata:
id: animals-foxes
name:
swe: 'Djurkorpus: Rävar'
In this example, the configuration will include everything from ../parent-config.yaml
, but the values for
metadata.id
and metadata.name.swe
will be overridden with animals-foxes
and Djurkorpus: Rävar
, respectively.
You can also specify multiple parent files in a list, with each subsequent parent overriding any conflicting values from the previous ones:
parent:
- ../parent-config.yaml
- ../another-parent-config.yaml
Nested parents, where a parent references another parent, are also supported.
Custom Annotations¶
The custom_annotations
section of the configuration file serves three distinct purposes, each detailed below.
Built-in Utility Annotations¶
Sparv includes a special type of annotator called "utility annotators". Unlike regular annotators which usually have
default settings but can be customized with configuration variables, utility annotators have no or few default settings,
and need to be configured using the custom_annotations
section. They can be used multiple times, and their purpose is
often to create new annotations by modifying the output of other annotators. Examples of utility annotators include
misc:affix
which adds a prefix and/or suffix to another annotation, and misc:find_replace_regex
which replaces text
in an annotation using a regular expression.
To use a utility annotator in your corpus, you need to configure it in the custom_annotations
section of your config
file. Utility annotators have no configuration variables, and instead use parameters. Here is an example configuration
for the misc:affix
annotator:
custom_annotations:
- annotator: misc:affix
params:
out: <token>:misc.word.affixed
chunk: <token:word>
prefix: "|"
suffix: "|"
In this example, the word annotation is used as input, and the string "|" is added as both a prefix and suffix. The
configuration specifies the annotator (annotator
) and sets the values for its parameters in the params
section. The
sparv modules
command lists the parameters for each utility annotator, helping you identify them.
For the misc:affix
annotator, the parameters are:
out
: The name of the output annotation, which must include the module name as a prefix (e.g.,misc
).chunk
: The input annotation to be modified.prefix
andsuffix
: The strings to be added as a prefix and/or suffix.
To include this annotation in your corpus, add <token>:misc.word.affixed
to an annotations list in your corpus config
(e.g., export.annotations
). This example is applied in the standard-swe example
corpus.
You can use the same annotator multiple times with different output names:
custom_annotations:
- annotator: misc:affix
params:
out: <token>:misc.word.affixed
chunk: <token:word>
prefix: "|"
suffix: "|"
- annotator: misc:affix
params:
out: <token>:misc.word.affixed2
chunk: <token:word>
prefix: "+"
suffix: "+"
Note
Custom annotations always create new annotations; they do not modify existing ones.
Note
When a parameter for a custom annotator requires a regular expression (e.g., in misc:find_replace_regex
),
the expression must be enclosed in single quotation marks. Regular expressions inside double quotation marks in YAML
are not parsed correctly.
Reusing Regular Annotations¶
You can also use the custom_annotations
section to reuse a regular (non-utility) annotation multiple times within your
corpus. For instance, if you want to apply the same part-of-speech tagger with different models, you can achieve this by
configuring the annotator in the custom_annotations
section.
To do this, specify the annotator and add a config
section with the necessary configuration, just as you would in the
root of the configuration file. Additionally, you must provide a suffix
to give the new annotation a unique name.
In the example below, we reuse the hunpos:msdtag
annotator with a custom model:
custom_annotations:
- annotator: hunpos:msdtag
suffix: -mymodel
config:
hunpos:
model: path/to/my_hunpos_model
The original Hunpos annotation is named <token>:hunpos.msd
. With the specified suffix, the new annotation will be
named <token>:hunpos.msd-mymodel
, which can then be included in the list of annotations:
export:
annotations:
- <token>:hunpos.msd
- <token>:hunpos.msd-mymodel
User-defined Custom Annotators¶
Typically, you extend Sparv with new annotators by creating a plugin, which makes the annotator available to all your corpora. However, in special cases, you might want to create a user-defined custom annotator, which is available only to the corpus in the same directory. Code-wise, plugins and custom annotators are essentially the same. The main difference lies in their packaging and usage.
For detailed instructions on writing a Sparv annotator, refer to the developer's guide. Below is a quick example.
Tip
The example uses the @annotator
decorator to create an annotator. You can also create your own importer,
exporter, installer, or model builder using the appropriate Sparv decorator. More information on decorators can be
found in the developer's guide.
Creating a user-defined custom annotator involves three steps:
- Create a Python script with an annotator and place it in your corpus directory.
- Register the annotator in your corpus config.
- Use your custom annotation by referring to it in an annotations list.
Step 1: Create a Python script in your corpus directory, for example, convert.py
:
my_corpus/
├── config.yaml
├── convert.py
└── source
├── document1.xml
└── document2.xml
Sparv will automatically detect scripts placed here, provided that your functions are registered in your
configuration file (see Step 2). Your annotator function must use one of the Sparv decorators (typically @annotator
).
Below is an example of a simple annotator that converts all tokens to uppercase:
from sparv.api import Annotation, Output, annotator
@annotator("Convert every word to uppercase.")
def uppercase(word: Annotation = Annotation("<token:word>"),
out: Output = Output("<token>:custom.convert.upper")):
"""Convert to uppercase."""
out.write([val.upper() for val in word.read()])
Step 2: Register your custom annotator in your corpus configuration file under the custom_annotations
section.
This allows Sparv to locate and use it. The name of your annotator is constructed by concatenating the following four
parts:
- the prefix
custom.
- the filename of the Python file without extension (
convert
in our example) - a colon
- the annotator name (
uppercase
)
custom_annotations:
- annotator: custom.convert:uppercase
Step 3: Use the annotation created by your custom annotator by adding the output annotation name (in this example
specified by the out
parameter value) to an annotations list in your corpus config:
export:
annotations:
- <token>:custom.convert.upper
In this example, all parameters in the annotator function have default values, so you do not need to supply any parameter values in your config. However, you can override the default values if needed:
custom_annotations:
- annotator: custom.convert:uppercase
params:
out: <token>:custom.convert.myUppercaseAnnotation
Note
When using custom annotations from your own code, all output annotations must be prefixed with custom
.
An example of a user-defined custom annotator can be found in the standard-swe example corpus.
For more information on writing an annotator function, refer to the developer's guide. If you have written a general annotator module, consider making it into a Sparv plugin so others can use it. Read more about writing plugins in the developer's guide.