Sparv Decorators¶

Sparv is built around a modular pipeline of processors, each responsible for a specific task in the corpus processing workflow. Processors are implemented as Python functions and registered using Sparv's decorators. These decorators attach metadata to each function—such as a description and configuration options—and register the function within the pipeline.

Sparv provides several types of processors: annotators, importers, exporters, installers, uninstallers, model builders, and wizards. Each processor type has a corresponding decorator, described in detail below.

For all decorators (except @wizard), the only required argument is description, a string explaining what the function does. This description is displayed in CLI help texts. The first line of the description should be a short summary, usually one sentence long. Optionally, a longer description can be added below the first line, separated by a blank line.

Builds a registry of all available annotator functions in Sparv modules.

annotator ¶

annotator(
    description,
    name=None,
    language=None,
    config=None,
    priority=None,
    order=None,
    wildcards=None,
    preloader=None,
    preloader_params=None,
    preloader_target=None,
    preloader_cleanup=None,
    preloader_shared=True,
)

Decorate a function to register it as an annotator.

An annotator is a function that processes input data and generates new annotations.

PARAMETER	DESCRIPTION
`description`	A description of the annotator, used for displaying help texts in the CLI. The first line should be a short summary of what the annotator does. Optionally, a longer description can be added below the first line, separated by a blank line. TYPE: `str`
`name`	Optional name to use instead of the function name. TYPE: `str \| None` DEFAULT: `None`
`language`	A list of supported languages. If no list is provided, all languages are supported. TYPE: `list[str] \| None` DEFAULT: `None`
`config`	A list of `Config` instances defining configuration parameters for the annotator. TYPE: `list[Config] \| None` DEFAULT: `None`
`priority`	Functions with higher priority (higher number) will be preferred when scheduling which functions to run. The default priority is 0. TYPE: `int \| None` DEFAULT: `None`
`order`	If multiple annotators produce the same output, this integer value helps determine which to try to use first. A lower number indicates higher priority. TYPE: `int \| None` DEFAULT: `None`
`wildcards`	A list of wildcards used in the annotator function's arguments. TYPE: `list[Wildcard] \| None` DEFAULT: `None`
`preloader`	A reference to a preloader function, used to preload models or processes. TYPE: `Callable \| None` DEFAULT: `None`
`preloader_params`	A list of names of parameters for the annotator, which will be used as arguments for the preloader. TYPE: `list[str] \| None` DEFAULT: `None`
`preloader_target`	The name of the annotator parameter that should receive the return value of the preloader. TYPE: `str \| None` DEFAULT: `None`
`preloader_cleanup`	A reference to an optional cleanup function, which will be executed after each annotator use. TYPE: `Callable \| None` DEFAULT: `None`
`preloader_shared`	Set to `False` if the preloader result should not be shared among preloader processes. TYPE: `bool` DEFAULT: `True`

Example:

@annotator(
    "Part-of-speech tags and baseforms from TreeTagger",
    language=["bul", "est", "fin", "lat", "nld", "pol", "ron", "slk", "deu", "eng", "fra", "spa", "ita", "rus"],
    config=[
        Config("treetagger.binary", "tree-tagger", description="TreeTagger executable"),
        Config("treetagger.model", "treetagger/[metadata.language].par", description="Path to TreeTagger model"),
    ],
)
def annotate(
    lang: Language = Language(),
    model: Model = Model("[treetagger.model]"),
    tt_binary: Binary = Binary("[treetagger.binary]"),
    out_upos: Output = Output("<token>:treetagger.upos", cls="token:upos", description="Part-of-speeches in UD"),
    out_pos: Output = Output("<token>:treetagger.pos", cls="token:pos", description="Part-of-speeches from TreeTagger"),
    out_baseform: Output = Output("<token>:treetagger.baseform", description="Baseforms from TreeTagger"),
    word: Annotation = Annotation("<token:word>"),
    sentence: Annotation = Annotation("<sentence>"),
):
    ...

Builds a registry of all available annotator functions in Sparv modules.

importer ¶

importer(
    description,
    file_extension,
    name=None,
    outputs=None,
    text_annotation=None,
    structure=None,
    config=None,
)

Decorate a function to register it as an importer.

An importer is a function that is responsible for importing corpus files of a specific file format. Its task is to read a corpus file, extract the corpus text and any existing markup (if applicable), and write annotation files for the corpus text and markup.

Importers do not use the Output class to specify their outputs. Instead, outputs are listed using the outputs argument of the decorator. Any output that needs to be used as explicit input by another part of the pipeline must be listed here, although additional unlisted outputs may also be created.

Two outputs are implicit (and thus not listed in outputs) but required for every importer: the corpus text, saved using the Text class, and a list of the annotations created from existing markup, saved using the SourceStructure class.

PARAMETER	DESCRIPTION
`description`	A description of the importer, used for displaying help texts in the CLI. The first line should be a short summary of what the importer does. Optionally, a longer description can be added below the first line, separated by a blank line. TYPE: `str`
`file_extension`	The file extension of the type of source this importer handles, e.g. "xml" or "txt". TYPE: `str`
`name`	An optional name to use instead of the function name. TYPE: `str \| None` DEFAULT: `None`
`outputs`	A list specifying the annotations and attributes that the importer is guaranteed to generate. This list can include annotation names directly, or one or more `Config` instances that refer to such lists or single annotations. Alternatively, `outputs` can point to a single `Config` instance that refers to such a list. TYPE: `list[str \| Config] \| Config \| None` DEFAULT: `None`
`text_annotation`	An annotation from 'outputs' that should be used as the value for the import.text_annotation config variable, unless it or classes.text has been set manually. TYPE: `str \| None` DEFAULT: `None`
`structure`	A class used to parse and return the structure of source files. TYPE: `type[SourceStructureParser] \| None` DEFAULT: `None`
`config`	A list of `Config` instances defining config parameters for the importer. TYPE: `list[Config] \| None` DEFAULT: `None`

Example:

@importer("TXT import", file_extension="txt", outputs=["text"])
def parse(
    source_file: SourceFilename = SourceFilename(),
    source_dir: Source = Source(),
    prefix: str = "",
    encoding: str = util.constants.UTF8,
    normalize: str = "NFC",
) -> None:
    ...

Builds a registry of all available annotator functions in Sparv modules.

exporter ¶

exporter(
    description,
    name=None,
    config=None,
    language=None,
    priority=None,
    order=None,
    abstract=False,
)

Decorate a function to register it as an exporter.

An exporter is a function that is responsible for generating final outputs, in Sparv referred to as exports. These outputs typically combine information from multiple annotations into a single file. The output produced by an exporter is generally not used as input for any other module. An export can consist of any kind of data, such as a frequency list, XML files, or a database dump. It can create one file per source file, combine information from all source files into a single output file, or follow any other structure as needed.

PARAMETER	DESCRIPTION
`description`	A description of the exporter, used for displaying help texts in the CLI. The first line should be a short summary of what the exporter does. Optionally, a longer description can be added below the first line, separated by a blank line. TYPE: `str`
`name`	An optional name to use instead of the function name. TYPE: `str \| None` DEFAULT: `None`
`config`	A list of `Config` instances defining config parameters for the exporter. TYPE: `list[Config] \| None` DEFAULT: `None`
`language`	A list of supported languages. If no list is provided, all languages are supported. TYPE: `list[str] \| None` DEFAULT: `None`
`priority`	Functions with higher priority (higher number) will be preferred when scheduling which functions to run. The default priority is 0. TYPE: `int \| None` DEFAULT: `None`
`order`	If several exporters produce the same output, this integer value will help decide which to try to use first. A lower number indicates higher priority. TYPE: `int \| None` DEFAULT: `None`
`abstract`	Set to `True` if this exporter does not produce any output files itself, but instead triggers other processors to produce their output files by using their output as input. TYPE: `bool` DEFAULT: `False`

Example:

@exporter(
    "Corpus word frequency list (without Swedish annotations)",
    order=2,
    config=[
        Config("stats_export.delimiter", default="\t", description="Delimiter separating columns"),
        Config(
            "stats_export.cutoff",
            default=1,
            description="The minimum frequency a word must have in order to be included in the result",
        ),
    ],
)
def freq_list_simple(
    corpus: Corpus = Corpus(),
    source_files: AllSourceFilenames = AllSourceFilenames(),
    word: AnnotationAllSourceFiles = AnnotationAllSourceFiles("<token:word>"),
    pos: AnnotationAllSourceFiles = AnnotationAllSourceFiles("<token:pos>"),
    baseform: AnnotationAllSourceFiles = AnnotationAllSourceFiles("<token:baseform>"),
    out: Export = Export("stats_export.frequency_list/stats_[metadata.id].csv"),
    delimiter: str = Config("stats_export.delimiter"),
    cutoff: int = Config("stats_export.cutoff"),
):
    ...

Builds a registry of all available annotator functions in Sparv modules.

installer ¶

installer(
    description,
    name=None,
    config=None,
    language=None,
    priority=None,
    uninstaller=None,
)

Decorate a function to register it as an installer.

An installer is a function that is responsible for deploying the corpus or related files to a remote location. For example, it can copy XML output to a web server or insert SQL data into a database.

Every installer must create a marker of the type OutputMarker at the end of a successful installation. Simply call the write() method on the marker to create the required marker.

It is recommended that an installer removes any related uninstaller's marker to enable uninstallation. Use the MarkerOptional class to refer to the uninstaller's marker without triggering an unnecessary installation.

PARAMETER	DESCRIPTION
`description`	A description of the installer, used for displaying help texts in the CLI. The first line should be a short summary of what the installer does. Optionally, a longer description can be added below the first line, separated by a blank line. TYPE: `str`
`name`	An optional name to use instead of the function name. TYPE: `str \| None` DEFAULT: `None`
`config`	A list of `Config` instances defining config parameters for the installer. TYPE: `list[Config] \| None` DEFAULT: `None`
`language`	A list of supported languages. If no list is provided, all languages are supported. TYPE: `list[str] \| None` DEFAULT: `None`
`priority`	Functions with higher priority (higher number) will be preferred when scheduling which functions to run. The default priority is 0. TYPE: `int \| None` DEFAULT: `None`
`uninstaller`	The name of the related uninstaller. TYPE: `str \| None` DEFAULT: `None`

Example:

@installer(
    "Copy compressed XML to remote host",
    config=[
        Config("xml_export.export_host", description="Remote host to copy XML export to."),
        Config("xml_export.export_path", description="Path on remote host to copy XML export to."),
    ],
    uninstaller="xml_export:uninstall"
)
def install(
    corpus: Corpus = Corpus(),
    xmlfile: ExportInput = ExportInput("xml_export.combined/[metadata.id].xml.bz2"),
    out: OutputMarker = OutputMarker("xml_export.install_export_pretty_marker"),
    uninstall_marker: MarkerOptional = MarkerOptional("xml_export.uninstall_export_pretty_marker"),
    export_path: str = Config("xml_export.export_path"),
    host: str | None = Config("xml_export.export_host"),
):
    ...

Builds a registry of all available annotator functions in Sparv modules.

uninstaller ¶

uninstaller(
    description,
    name=None,
    config=None,
    language=None,
    priority=None,
)

Decorate a function to register it as an uninstaller.

An uninstaller is a function that undoes the actions performed by an installer, such as removing corpus files from a remote location or deleting corpus data from a database.

Every uninstaller must create a marker of the type OutputMarker at the end of a successful uninstallation. Simply call the write() method on the marker to create the required marker.

It is recommended that an uninstaller removes any related installer's marker to enable re-installation. Use the MarkerOptional class to refer to the installer's marker without triggering an unnecessary installation.

PARAMETER	DESCRIPTION
`description`	A description of the uninstaller, used for displaying help texts in the CLI. The first line should be a short summary of what the uninstaller does. Optionally, a longer description can be added below the first line, separated by a blank line. TYPE: `str`
`name`	An optional name to use instead of the function name. TYPE: `str \| None` DEFAULT: `None`
`config`	A list of `Config` instances defining config parameters for the uninstaller. TYPE: `list[Config] \| None` DEFAULT: `None`
`language`	A list of supported languages. If no list is provided, all languages are supported. TYPE: `list[str] \| None` DEFAULT: `None`
`priority`	Functions with higher priority (higher number) will be preferred when scheduling which functions to run. The default priority is 0. TYPE: `int \| None` DEFAULT: `None`

Example:

@uninstaller(
    "Remove compressed XML from remote host",
    config=[
        Config("xml_export.export_host", description="Remote host to remove XML export from."),
        Config("xml_export.export_path", description="Path on remote host to remove XML export from."),
    ],
)
def uninstall(
    corpus: Corpus = Corpus(),
    xmlfile: ExportInput = ExportInput("xml_export.combined/[metadata.id].xml.bz2"),
    out: OutputMarker = OutputMarker("xml_export.uninstall_export_pretty_marker"),
    install_marker: MarkerOptional = MarkerOptional("xml_export.install_export_pretty_marker"),
    export_path: str = Config("xml_export.export_path"),
    host: str | None = Config("xml_export.export_host"),
):
    ...

Builds a registry of all available annotator functions in Sparv modules.

modelbuilder ¶

modelbuilder(
    description,
    name=None,
    config=None,
    language=None,
    priority=None,
    order=None,
)

Decorate a function to register it as a model builder.

A model builder is a function that sets up one or more models that other Sparv processors (typically annotators) rely on. Setting up a model might involve tasks such as downloading a file, unzipping it, converting it to a different format, and saving it in Sparv's data directory. Models are generally not specific to a single corpus; once a model is set up on your system, it will be available for any corpus.

PARAMETER	DESCRIPTION
`description`	A description of the model builder, used for displaying help texts in the CLI. The first line should be a short summary of what the model builder does. Optionally, a longer description can be added below the first line, separated by a blank line. TYPE: `str`
`name`	An optional name to use instead of the function name. TYPE: `str \| None` DEFAULT: `None`
`config`	A list of `Config` instances defining config parameters for the model builder. TYPE: `list[Config] \| None` DEFAULT: `None`
`language`	A list of supported languages. If no list is provided, all languages are supported. TYPE: `list[str] \| None` DEFAULT: `None`
`priority`	Functions with higher priority (higher number) will be preferred when scheduling which functions to run. The default priority is 0. TYPE: `int \| None` DEFAULT: `None`
`order`	If several model builders have the same output, this integer value will help decide which to try to use first. A lower number indicates higher priority. TYPE: `int \| None` DEFAULT: `None`

Example:

@modelbuilder("Sentiment model (SenSALDO)", language=["swe"])
def build_model(out: ModelOutput = ModelOutput("sensaldo/sensaldo.pickle")):
   ...

Builds a registry of all available annotator functions in Sparv modules.

wizard ¶

wizard(config_keys, source_structure=False)

Decorate a function to register it as a wizard.

A wizard is a function that is used to generate questions for the corpus config wizard.

Note

The wizard functionality is deprecated and will be removed in a future version of Sparv.

PARAMETER	DESCRIPTION
`config_keys`	A list of config keys to be set or changed by the decorated function. TYPE: `list[str]`
`source_structure`	Set to `True` if the decorated function needs access to a `SourceStructureParser` instance (holding information on the structure of the source files). TYPE: `bool` DEFAULT: `False`

Example:

@wizard(["export.source_annotations"], source_structure=True)
def import_wizard(answers, structure: SourceStructureParser):
    ...