Skip to content

Sparv Decorators

Sparv is built around a modular pipeline of processors, each responsible for a specific task in the corpus processing workflow. Processors are implemented as Python functions and registered using Sparv's decorators. These decorators attach metadata to each function—such as a description and configuration options—and register the function within the pipeline.

Sparv provides several types of processors: annotators, importers, exporters, installers, uninstallers, model builders, and wizards. Each processor type has a corresponding decorator, described in detail below.

For all decorators (except @wizard), the only required argument is description, a string explaining what the function does. This description is displayed in CLI help texts. The first line of the description should be a short summary, usually one sentence long. Optionally, a longer description can be added below the first line, separated by a blank line.

Builds a registry of all available annotator functions in Sparv modules.

annotator

annotator(
    description,
    name=None,
    language=None,
    config=None,
    priority=None,
    order=None,
    wildcards=None,
    preloader=None,
    preloader_params=None,
    preloader_target=None,
    preloader_cleanup=None,
    preloader_shared=True,
)

Decorate a function to register it as an annotator.

An annotator is a function that processes input data and generates new annotations.

PARAMETER DESCRIPTION
description

A description of the annotator, used for displaying help texts in the CLI. The first line should be a short summary of what the annotator does. Optionally, a longer description can be added below the first line, separated by a blank line.

TYPE: str

name

Optional name to use instead of the function name.

TYPE: str | None DEFAULT: None

language

A list of supported languages. If no list is provided, all languages are supported.

TYPE: list[str] | None DEFAULT: None

config

A list of Config instances defining configuration parameters for the annotator.

TYPE: list[Config] | None DEFAULT: None

priority

Functions with higher priority (higher number) will be preferred when scheduling which functions to run. The default priority is 0.

TYPE: int | None DEFAULT: None

order

If multiple annotators produce the same output, this integer value helps determine which to try to use first. A lower number indicates higher priority.

TYPE: int | None DEFAULT: None

wildcards

A list of wildcards used in the annotator function's arguments.

TYPE: list[Wildcard] | None DEFAULT: None

preloader

A reference to a preloader function, used to preload models or processes.

TYPE: Callable | None DEFAULT: None

preloader_params

A list of names of parameters for the annotator, which will be used as arguments for the preloader.

TYPE: list[str] | None DEFAULT: None

preloader_target

The name of the annotator parameter that should receive the return value of the preloader.

TYPE: str | None DEFAULT: None

preloader_cleanup

A reference to an optional cleanup function, which will be executed after each annotator use.

TYPE: Callable | None DEFAULT: None

preloader_shared

Set to False if the preloader result should not be shared among preloader processes.

TYPE: bool DEFAULT: True

Example:

@annotator(
    "Part-of-speech tags and baseforms from TreeTagger",
    language=["bul", "est", "fin", "lat", "nld", "pol", "ron", "slk", "deu", "eng", "fra", "spa", "ita", "rus"],
    config=[
        Config("treetagger.binary", "tree-tagger", description="TreeTagger executable"),
        Config("treetagger.model", "treetagger/[metadata.language].par", description="Path to TreeTagger model"),
    ],
)
def annotate(
    lang: Language = Language(),
    model: Model = Model("[treetagger.model]"),
    tt_binary: Binary = Binary("[treetagger.binary]"),
    out_upos: Output = Output("<token>:treetagger.upos", cls="token:upos", description="Part-of-speeches in UD"),
    out_pos: Output = Output("<token>:treetagger.pos", cls="token:pos", description="Part-of-speeches from TreeTagger"),
    out_baseform: Output = Output("<token>:treetagger.baseform", description="Baseforms from TreeTagger"),
    word: Annotation = Annotation("<token:word>"),
    sentence: Annotation = Annotation("<sentence>"),
):
    ...

Builds a registry of all available annotator functions in Sparv modules.

importer

importer(
    description,
    file_extension,
    name=None,
    outputs=None,
    text_annotation=None,
    structure=None,
    config=None,
)

Decorate a function to register it as an importer.

An importer is a function that is responsible for importing corpus files of a specific file format. Its task is to read a corpus file, extract the corpus text and any existing markup (if applicable), and write annotation files for the corpus text and markup.

Importers do not use the Output class to specify their outputs. Instead, outputs are listed using the outputs argument of the decorator. Any output that needs to be used as explicit input by another part of the pipeline must be listed here, although additional unlisted outputs may also be created.

Two outputs are implicit (and thus not listed in outputs) but required for every importer: the corpus text, saved using the Text class, and a list of the annotations created from existing markup, saved using the SourceStructure class.

PARAMETER DESCRIPTION
description

A description of the importer, used for displaying help texts in the CLI. The first line should be a short summary of what the importer does. Optionally, a longer description can be added below the first line, separated by a blank line.

TYPE: str

file_extension

The file extension of the type of source this importer handles, e.g. "xml" or "txt".

TYPE: str

name

An optional name to use instead of the function name.

TYPE: str | None DEFAULT: None

outputs

A list specifying the annotations and attributes that the importer is guaranteed to generate. This list can include annotation names directly, or one or more Config instances that refer to such lists or single annotations. Alternatively, outputs can point to a single Config instance that refers to such a list.

TYPE: list[str | Config] | Config | None DEFAULT: None

text_annotation

An annotation from 'outputs' that should be used as the value for the import.text_annotation config variable, unless it or classes.text has been set manually.

TYPE: str | None DEFAULT: None

structure

A class used to parse and return the structure of source files.

TYPE: type[SourceStructureParser] | None DEFAULT: None

config

A list of Config instances defining config parameters for the importer.

TYPE: list[Config] | None DEFAULT: None

Example:

@importer("TXT import", file_extension="txt", outputs=["text"])
def parse(
    source_file: SourceFilename = SourceFilename(),
    source_dir: Source = Source(),
    prefix: str = "",
    encoding: str = util.constants.UTF8,
    normalize: str = "NFC",
) -> None:
    ...

Builds a registry of all available annotator functions in Sparv modules.

exporter

exporter(
    description,
    name=None,
    config=None,
    language=None,
    priority=None,
    order=None,
    abstract=False,
)

Decorate a function to register it as an exporter.

An exporter is a function that is responsible for generating final outputs, in Sparv referred to as exports. These outputs typically combine information from multiple annotations into a single file. The output produced by an exporter is generally not used as input for any other module. An export can consist of any kind of data, such as a frequency list, XML files, or a database dump. It can create one file per source file, combine information from all source files into a single output file, or follow any other structure as needed.

PARAMETER DESCRIPTION
description

A description of the exporter, used for displaying help texts in the CLI. The first line should be a short summary of what the exporter does. Optionally, a longer description can be added below the first line, separated by a blank line.

TYPE: str

name

An optional name to use instead of the function name.

TYPE: str | None DEFAULT: None

config

A list of Config instances defining config parameters for the exporter.

TYPE: list[Config] | None DEFAULT: None

language

A list of supported languages. If no list is provided, all languages are supported.

TYPE: list[str] | None DEFAULT: None

priority

Functions with higher priority (higher number) will be preferred when scheduling which functions to run. The default priority is 0.

TYPE: int | None DEFAULT: None

order

If several exporters produce the same output, this integer value will help decide which to try to use first. A lower number indicates higher priority.

TYPE: int | None DEFAULT: None

abstract

Set to True if this exporter does not produce any output files itself, but instead triggers other processors to produce their output files by using their output as input.

TYPE: bool DEFAULT: False

Example:

@exporter(
    "Corpus word frequency list (without Swedish annotations)",
    order=2,
    config=[
        Config("stats_export.delimiter", default="\t", description="Delimiter separating columns"),
        Config(
            "stats_export.cutoff",
            default=1,
            description="The minimum frequency a word must have in order to be included in the result",
        ),
    ],
)
def freq_list_simple(
    corpus: Corpus = Corpus(),
    source_files: AllSourceFilenames = AllSourceFilenames(),
    word: AnnotationAllSourceFiles = AnnotationAllSourceFiles("<token:word>"),
    pos: AnnotationAllSourceFiles = AnnotationAllSourceFiles("<token:pos>"),
    baseform: AnnotationAllSourceFiles = AnnotationAllSourceFiles("<token:baseform>"),
    out: Export = Export("stats_export.frequency_list/stats_[metadata.id].csv"),
    delimiter: str = Config("stats_export.delimiter"),
    cutoff: int = Config("stats_export.cutoff"),
):
    ...

Builds a registry of all available annotator functions in Sparv modules.

installer

installer(
    description,
    name=None,
    config=None,
    language=None,
    priority=None,
    uninstaller=None,
)

Decorate a function to register it as an installer.

An installer is a function that is responsible for deploying the corpus or related files to a remote location. For example, it can copy XML output to a web server or insert SQL data into a database.

Every installer must create a marker of the type OutputMarker at the end of a successful installation. Simply call the write() method on the marker to create the required marker.

It is recommended that an installer removes any related uninstaller's marker to enable uninstallation. Use the MarkerOptional class to refer to the uninstaller's marker without triggering an unnecessary installation.

PARAMETER DESCRIPTION
description

A description of the installer, used for displaying help texts in the CLI. The first line should be a short summary of what the installer does. Optionally, a longer description can be added below the first line, separated by a blank line.

TYPE: str

name

An optional name to use instead of the function name.

TYPE: str | None DEFAULT: None

config

A list of Config instances defining config parameters for the installer.

TYPE: list[Config] | None DEFAULT: None

language

A list of supported languages. If no list is provided, all languages are supported.

TYPE: list[str] | None DEFAULT: None

priority

Functions with higher priority (higher number) will be preferred when scheduling which functions to run. The default priority is 0.

TYPE: int | None DEFAULT: None

uninstaller

The name of the related uninstaller.

TYPE: str | None DEFAULT: None

Example:

@installer(
    "Copy compressed XML to remote host",
    config=[
        Config("xml_export.export_host", description="Remote host to copy XML export to."),
        Config("xml_export.export_path", description="Path on remote host to copy XML export to."),
    ],
    uninstaller="xml_export:uninstall"
)
def install(
    corpus: Corpus = Corpus(),
    xmlfile: ExportInput = ExportInput("xml_export.combined/[metadata.id].xml.bz2"),
    out: OutputMarker = OutputMarker("xml_export.install_export_pretty_marker"),
    uninstall_marker: MarkerOptional = MarkerOptional("xml_export.uninstall_export_pretty_marker"),
    export_path: str = Config("xml_export.export_path"),
    host: str | None = Config("xml_export.export_host"),
):
    ...

Builds a registry of all available annotator functions in Sparv modules.

uninstaller

uninstaller(
    description,
    name=None,
    config=None,
    language=None,
    priority=None,
)

Decorate a function to register it as an uninstaller.

An uninstaller is a function that undoes the actions performed by an installer, such as removing corpus files from a remote location or deleting corpus data from a database.

Every uninstaller must create a marker of the type OutputMarker at the end of a successful uninstallation. Simply call the write() method on the marker to create the required marker.

It is recommended that an uninstaller removes any related installer's marker to enable re-installation. Use the MarkerOptional class to refer to the installer's marker without triggering an unnecessary installation.

PARAMETER DESCRIPTION
description

A description of the uninstaller, used for displaying help texts in the CLI. The first line should be a short summary of what the uninstaller does. Optionally, a longer description can be added below the first line, separated by a blank line.

TYPE: str

name

An optional name to use instead of the function name.

TYPE: str | None DEFAULT: None

config

A list of Config instances defining config parameters for the uninstaller.

TYPE: list[Config] | None DEFAULT: None

language

A list of supported languages. If no list is provided, all languages are supported.

TYPE: list[str] | None DEFAULT: None

priority

Functions with higher priority (higher number) will be preferred when scheduling which functions to run. The default priority is 0.

TYPE: int | None DEFAULT: None

Example:

@uninstaller(
    "Remove compressed XML from remote host",
    config=[
        Config("xml_export.export_host", description="Remote host to remove XML export from."),
        Config("xml_export.export_path", description="Path on remote host to remove XML export from."),
    ],
)
def uninstall(
    corpus: Corpus = Corpus(),
    xmlfile: ExportInput = ExportInput("xml_export.combined/[metadata.id].xml.bz2"),
    out: OutputMarker = OutputMarker("xml_export.uninstall_export_pretty_marker"),
    install_marker: MarkerOptional = MarkerOptional("xml_export.install_export_pretty_marker"),
    export_path: str = Config("xml_export.export_path"),
    host: str | None = Config("xml_export.export_host"),
):
    ...

Builds a registry of all available annotator functions in Sparv modules.

modelbuilder

modelbuilder(
    description,
    name=None,
    config=None,
    language=None,
    priority=None,
    order=None,
)

Decorate a function to register it as a model builder.

A model builder is a function that sets up one or more models that other Sparv processors (typically annotators) rely on. Setting up a model might involve tasks such as downloading a file, unzipping it, converting it to a different format, and saving it in Sparv's data directory. Models are generally not specific to a single corpus; once a model is set up on your system, it will be available for any corpus.

PARAMETER DESCRIPTION
description

A description of the model builder, used for displaying help texts in the CLI. The first line should be a short summary of what the model builder does. Optionally, a longer description can be added below the first line, separated by a blank line.

TYPE: str

name

An optional name to use instead of the function name.

TYPE: str | None DEFAULT: None

config

A list of Config instances defining config parameters for the model builder.

TYPE: list[Config] | None DEFAULT: None

language

A list of supported languages. If no list is provided, all languages are supported.

TYPE: list[str] | None DEFAULT: None

priority

Functions with higher priority (higher number) will be preferred when scheduling which functions to run. The default priority is 0.

TYPE: int | None DEFAULT: None

order

If several model builders have the same output, this integer value will help decide which to try to use first. A lower number indicates higher priority.

TYPE: int | None DEFAULT: None

Example:

@modelbuilder("Sentiment model (SenSALDO)", language=["swe"])
def build_model(out: ModelOutput = ModelOutput("sensaldo/sensaldo.pickle")):
   ...

Builds a registry of all available annotator functions in Sparv modules.

wizard

wizard(config_keys, source_structure=False)

Decorate a function to register it as a wizard.

A wizard is a function that is used to generate questions for the corpus config wizard.

Note

The wizard functionality is deprecated and will be removed in a future version of Sparv.

PARAMETER DESCRIPTION
config_keys

A list of config keys to be set or changed by the decorated function.

TYPE: list[str]

source_structure

Set to True if the decorated function needs access to a SourceStructureParser instance (holding information on the structure of the source files).

TYPE: bool DEFAULT: False

Example:

@wizard(["export.source_annotations"], source_structure=True)
def import_wizard(answers, structure: SourceStructureParser):
    ...