Skip to content

Utilities

Sparv provides a variety of utility functions, classes, and constants that are useful across different modules. These utilities are primarily imported from sparv.api.util and its submodules. For example:

from sparv.api.util.system import call_binary

Constants

The sparv.api.util.constants module includes several predefined constants that are used throughout the Sparv pipeline:

  • DELIM = "|": Delimiter character used to separate ambiguous results.
  • AFFIX = "|": Character used to enclose results, marking them as a set.
  • SCORESEP = ":": Character that separates an annotation from its score.
  • COMPSEP = "+": Character used to separate parts of a compound.
  • UNDEF = "__UNDEF__": Value representing undefined annotations.
  • OVERLAP_ATTR = "overlap": Name for automatically created overlap attributes.
  • SPARV_DEFAULT_NAMESPACE = "sparv": Default namespace used when annotation names collide and sparv_namespace is not set in the configuration.
  • UTF8 = "UTF-8": UTF-8 encoding.
  • LATIN1 = "ISO-8859-1": Latin-1 encoding.
  • HEADER_CONTENTS = "contents": Name of the annotation containing header contents.

Export Utils

sparv.api.util.export provides utility functions for preparing data for export.

gather_annotations

gather_annotations(
    annotations,
    export_names,
    header_annotations=None,
    source_file=None,
    flatten=True,
    split_overlaps=False,
)

Calculate the span hierarchy and the annotation_dict containing all annotation elements and attributes.

PARAMETER DESCRIPTION
annotations

List of annotations to include.

TYPE: list[Annotation]

export_names

Dictionary that maps from annotation names to export names.

TYPE: dict[str, str]

header_annotations

List of header annotations.

TYPE: list[Annotation] | None DEFAULT: None

source_file

The source filename.

TYPE: str | None DEFAULT: None

flatten

Whether to return the spans as a flat list.

TYPE: bool DEFAULT: True

split_overlaps

Whether to split up overlapping spans.

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION
list[tuple]

A spans_dict and an annotation_dict if flatten is True, otherwise returns span_positions and

dict[str, dict]

annotation_dict.

RAISES DESCRIPTION
SparvErrorMessage

If the source file is not found for header annotations.

calculate_element_hierarchy

calculate_element_hierarchy(source_file, spans_list)

Calculate the hierarchy for spans with identical start and end positions.

If two spans A and B have identical start and end positions, go through all occurrences of A and B and check which element is most often parent to the other.

PARAMETER DESCRIPTION
source_file

The source filename.

TYPE: str

spans_list

List of spans to check for hierarchy.

TYPE: list

RETURNS DESCRIPTION
dict[str, dict[str, int]]

A dictionary with the hierarchy of spans.

get_annotation_names

get_annotation_names(
    annotations,
    source_annotations=None,
    source_file=None,
    token_name=None,
    remove_namespaces=False,
    keep_struct_names=False,
    sparv_namespace=None,
    source_namespace=None,
    xml_mode=False,
)

Get a list of annotations, token attributes, and a dictionary translating annotation names to export names.

PARAMETER DESCRIPTION
annotations

List of elements:attributes (annotations) to include, with possible export names.

TYPE: ExportAnnotations | ExportAnnotationsAllSourceFiles | list[tuple[Annotation | AnnotationAllSourceFiles, str | None]]

source_annotations

List of elements:attributes from the source file to include, with possible export names. If not specified, includes everything.

TYPE: SourceAnnotations | SourceAnnotationsAllSourceFiles DEFAULT: None

source_file

Name of the source file.

TYPE: str | None DEFAULT: None

token_name

Name of the token annotation.

TYPE: str | None DEFAULT: None

remove_namespaces

Set to True to remove all namespaces in export_names unless names are ambiguous.

TYPE: bool DEFAULT: False

keep_struct_names

Set to True to include the annotation base name (everything before ":") in export_names for annotations that are not token attributes.

TYPE: bool DEFAULT: False

sparv_namespace

Namespace to add to all Sparv annotations.

TYPE: str | None DEFAULT: None

source_namespace

Namespace to add to all annotations from the source file.

TYPE: str | None DEFAULT: None

xml_mode

Set to True to use XML namespaces in export_names.

TYPE: bool | None DEFAULT: False

RETURNS DESCRIPTION
list[Annotation | AnnotationAllSourceFiles]

A list of annotations, a list of token attribute names, a dictionary with translation from annotation names to

list[str]

export names.

get_header_names

get_header_names(header_annotations, xml_namespaces)

Get a list of header annotations and a dictionary for renamed annotations.

PARAMETER DESCRIPTION
header_annotations

List of header annotations from the source file to include. If not specified, includes everything.

TYPE: HeaderAnnotations | None

xml_namespaces

XML namespaces to use for the header annotations.

TYPE: dict[str, str]

RETURNS DESCRIPTION
tuple[list[Annotation], dict[str, str]]

A list of header annotations and a dictionary with translation from annotation names to export names.

scramble_spans

scramble_spans(span_positions, chunk_name, chunk_order)

Reorder spans based on chunk_order and ensure tags are opened and closed correctly.

PARAMETER DESCRIPTION
span_positions

Original span positions, typically obtained from gather_annotations().

TYPE: list[tuple]

chunk_name

Name of the annotation to reorder.

TYPE: str

chunk_order

Annotation specifying the new order of the chunks.

TYPE: Annotation

RETURNS DESCRIPTION
list[tuple]

List of tuples with the new span positions and instructions.

Install/Uninstall Utils

sparv.api.util.install provides functions for installing and uninstalling corpora, either locally or remotely.

install_path

install_path(source_path, host, target_path)

Transfer a file or the contents of a directory to a target destination, optionally on a different host.

PARAMETER DESCRIPTION
source_path

Path to the local file or directory to sync. If a directory is specified, its contents are synced, not the directory itself, and any extraneous files in destination directories are deleted.

TYPE: str | Path

host

Remote host to install to. Set to None to install locally.

TYPE: str | None

target_path

Path to the target file or directory.

TYPE: str | Path

uninstall_path

uninstall_path(path, host=None)

Remove a file or directory, optionally on a different host.

PARAMETER DESCRIPTION
path

Path to the file or directory to remove.

TYPE: str | Path

host

Remote host where the file or directory is located. Set to None to remove locally.

TYPE: str | None DEFAULT: None

install_mysql

install_mysql(host, db_name, sqlfile)

Insert tables and data from one or more SQL files into a local or remote MySQL database.

PARAMETER DESCRIPTION
host

The remote host to install to. Set to None to install locally.

TYPE: str | None

db_name

The name of the database.

TYPE: str

sqlfile

The path to a SQL file, or a list of paths to multiple SQL files.

TYPE: Path | str | list[Path | str]

install_mysql_dump

install_mysql_dump(host, db_name, tables)

Copy selected tables, including their data, from a local MySQL database to a remote one.

PARAMETER DESCRIPTION
host

The remote host to install to.

TYPE: str

db_name

The name of the remote database.

TYPE: str

tables

A table name or a list of table names. If a list is provided, the tables are separated by spaces.

TYPE: str | Iterable[str]

install_svn

install_svn(source_file, svn_url, remove_existing=False)

Check in a file to an SVN repository.

If the file is already in the repository, it will be deleted and added again.

PARAMETER DESCRIPTION
source_file

The file to check in.

TYPE: str | Path

svn_url

The URL to the SVN repository, including the path to the file.

TYPE: str

remove_existing

If False, this function can only be used to add new files to the repository. If True, existing files will be deleted before the import.

TYPE: bool DEFAULT: False

RAISES DESCRIPTION
SparvErrorMessage

If the source_file does not exist, if svn_url is not set, if it is not possible to list or delete the file in the SVN repository, if remove_existing is set to False and source_file already exists in the repository, or if it is not possible to import the file to the SVN repository.

uninstall_svn

uninstall_svn(svn_url)

Delete a file from an SVN repository.

PARAMETER DESCRIPTION
svn_url

The URL to the SVN repository including the name of the file to remove.

TYPE: str

RAISES DESCRIPTION
SparvErrorMessage

If svn_url is not set, or if deletion fails.

install_git

install_git(source_file, repo_path, commit_message=None)

Copy a file to a local Git repository and make a commit.

PARAMETER DESCRIPTION
source_file

The file to copy.

TYPE: str | Path

repo_path

The path to the local Git repository.

TYPE: str | Path

commit_message

The commit message. If not set, a default message will be used.

TYPE: str | None DEFAULT: None

RAISES DESCRIPTION
SparvErrorMessage

If the source file does not exist, if repo_path is not set, or if it is not possible to add the file to the Git repository.

uninstall_git

uninstall_git(file_path, commit_message=None)

Remove a file from a local Git repository and make a commit.

PARAMETER DESCRIPTION
file_path

The path to file to remove.

TYPE: str | Path

commit_message

The commit message. If not set, a default message will be used.

TYPE: str | None DEFAULT: None

RAISES DESCRIPTION
SparvErrorMessage

If repo_path is not set, if the file does not exist in the Git repository, or if it is not possible to remove the file from the Git repository.

System Utils

sparv.api.util.system provides functions for managing processes, creating directories, and more.

kill_process

kill_process(process)

Terminate a process, ignoring any errors if the process is already terminated.

PARAMETER DESCRIPTION
process

The process to be terminated.

TYPE: Popen

RAISES DESCRIPTION
OSError

If an error occurs while killing the process.

clear_directory

clear_directory(path)

Create a new empty directory at the given path, and remove its contents if it already exists.

PARAMETER DESCRIPTION
path

The path where the directory should be created.

TYPE: str | Path

call_java

call_java(
    jar,
    arguments,
    options=(),
    stdin="",
    search_paths=(),
    encoding=None,
    verbose=False,
    return_command=False,
)

Execute a Java program using a specified jar file, command line arguments, and stdin input.

PARAMETER DESCRIPTION
jar

The name of the jar file to execute.

TYPE: str

arguments

A list of arguments to pass to the Java program.

TYPE: list | tuple

options

A list of Java options to include in the call.

TYPE: list | tuple DEFAULT: ()

stdin

Input to pass to the program's stdin.

TYPE: str DEFAULT: ''

search_paths

Additional paths to search for the Java binary, in addition to the environment variable PATH.

TYPE: list | tuple DEFAULT: ()

encoding

The encoding to use for stdin and stdout.

TYPE: str | None DEFAULT: None

verbose

If True, pipe stderr to stderr in the terminal, instead of returning it.

TYPE: bool DEFAULT: False

return_command

If True, return the process instead of stdout and stderr.

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION
tuple[str, str] | Popen

A tuple with stdout and stderr, or the process if return_command is True.

tuple[str, str] | Popen

If verbose is True, stderr is an empty string.

call_binary

call_binary(
    name,
    arguments=(),
    stdin="",
    raw_command=None,
    search_paths=(),
    encoding=None,
    verbose=False,
    use_shell=False,
    allow_error=False,
    return_command=False,
)

Call a binary with specified arguments and stdin.

PARAMETER DESCRIPTION
name

The binary to execute (can include absolute or relative path). Accepts a string, a Path, or an iterable of strings or Paths, using the first found binary.

TYPE: str | Path | Iterable[str | Path]

arguments

List of arguments to pass to the binary.

TYPE: list | tuple DEFAULT: ()

stdin

Input to pass to the process's stdin.

TYPE: str | list | tuple DEFAULT: ''

raw_command

A raw command to execute through the shell (implies use_shell=True).

TYPE: str | None DEFAULT: None

search_paths

Additional paths to search for the binary, besides the environment variable PATH.

TYPE: list | tuple DEFAULT: ()

encoding

Encoding to use for stdin and stdout.

TYPE: str | None DEFAULT: None

verbose

If True, pipe stderr to stderr in the terminal, instead of returning it.

TYPE: bool DEFAULT: False

use_shell

If True, executes the command through the shell. Automatically set to True if raw_command is provided.

TYPE: bool DEFAULT: False

allow_error

If False (default), raises an error if the binary returns a non-zero exit code, and log both stdout and stderr.

TYPE: bool DEFAULT: False

return_command

If True, returns the process instead of stdout and stderr.

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION
tuple[str, str] | Popen

A tuple with stdout and stderr, or the process if return_command is True.

tuple[str, str] | Popen

If verbose is True, stderr is an empty string.

RAISES DESCRIPTION
OSError

If an error occurs while calling the binary.

find_binary

find_binary(
    name,
    search_paths=(),
    executable=True,
    allow_dir=False,
    raise_error=False,
)

Locate the binary for a given program.

PARAMETER DESCRIPTION
name

The name of the binary, either as a string or Path, or an iterable of strings or Paths with alternative names.

TYPE: str | Path | Iterable[str | Path]

search_paths

A list of additional paths to search, besides those in the environment variable PATH.

TYPE: list | tuple DEFAULT: ()

executable

If False, does not fail when the binary is not executable.

TYPE: bool DEFAULT: True

allow_dir

If True, allows the target to be a directory instead of a file.

TYPE: bool DEFAULT: False

raise_error

If True, raises an error if the binary could not be found.

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION
str | None

The path to binary, or None if not found.

RAISES DESCRIPTION
SparvErrorMessage

If raise_error is True and the binary could not be found.

rsync

rsync(local, host, remote)

Transfer files and directories using rsync.

When syncing a directory, extraneous files in the destination directory are deleted, and it is always the contents of the source directory that are synced, not the directory itself (i.e. the rsync source directory is always suffixed with a slash).

PARAMETER DESCRIPTION
local

The file or directory to transfer.

TYPE: str | Path

host

The remote host to transfer to. Set to None to transfer locally.

TYPE: str | None

remote

The path to the target file or directory.

TYPE: str | Path

remove_path

remove_path(path, host=None)

Remove a file or directory, either locally or remotely.

PARAMETER DESCRIPTION
path

The file or directory to remove.

TYPE: str | Path

host

The remote host to remove from. Leave as None to remove locally.

TYPE: str | None DEFAULT: None

gpus

gpus(reorder=True)

Get a list of available GPUs, sorted by free memory in descending order.

Only works for NVIDIA GPUs, and requires the nvidia-smi utility to be installed.

If reorder is True (default), the GPUs are renumbered according to the order specified in the environment variable CUDA_VISIBLE_DEVICES. For example, if CUDA_VISIBLE_DEVICES=1,0, and the GPUs with most free memory are 0, 1, the function will return [1, 0].

This is needed for PyTorch, which uses the GPU indices as specified in CUDA_VISIBLE_DEVICES, not the actual GPU indices. In the example above, PyTorch would consider GPU 1 as GPU 0 and GPU 0 as GPU 1.

PARAMETER DESCRIPTION
reorder

Whether to renumber the GPUs according to the order in the environment variable CUDA_VISIBLE_DEVICES.

TYPE: bool DEFAULT: True

RETURNS DESCRIPTION
list[int] | None

A list of GPU indices, or None if no GPUs are available or if the nvidia-smi command failed.

call_svn

call_svn(command, *args)

Call an SVN command.

Will try to authenticate with SVN_USERNAME and SVN_PASSWORD environment variables if set.

PARAMETER DESCRIPTION
command

The SVN command to call.

TYPE: str

*args

Additional arguments for the SVN command.

TYPE: str DEFAULT: ()

RETURNS DESCRIPTION
int

The return code from the command.

TYPE: int

RAISES DESCRIPTION
SparvErrorMessage

If the SVN command fails.

FileNotFoundError

If the file is not found in SVN.

Tag Sets

The sparv.api.util.tagsets subpackage includes modules with functions and objects for tag set conversions.

Join a complex tag into a string.

The tag can be a dict {"pos": pos, "msd": msd} or a tuple (pos, msd).

PARAMETER DESCRIPTION
tag

A tag to join.

TYPE: dict | tuple

sep

A separator between parts of the tag in the result.

TYPE: str DEFAULT: TAGSEP

RETURNS DESCRIPTION
str

The joined tag.

tagmappings.join_tag()

Convert a complex SUC or SALDO tag record into a string.

Parameters:

  • tag: The tag to convert, which can be a dictionary ({'pos': pos, 'msd': msd}) or a tuple ((pos, msd)).
  • sep: The separator to use. Default: "."

tagmappings.mappings

Mappings of part-of-speech tags between different tag sets.

pos_to_upos()

Map part-of-speech tags to Universal Dependency part-of-speech tags. This function only works if there is a conversion function in util.tagsets.pos_to_upos for the specified language and tag set.

Parameters:

  • pos: The part-of-speech tag to convert.
  • lang: The language code.
  • tagset: The name of the tag set to which pos belongs.

tagmappings.split_tag()

Split a SUC or Saldo tag string ('X.Y.Z') into a tuple ('X', 'Y.Z'), where 'X' is the part of speech and 'Y', 'Z', etc., are morphological features (i.e., MSD tags).

Parameters:

  • tag: The tag string to split into a tuple.
  • sep: The separator to split on. Default: "."

suc_to_feats()

Convert SUC MSD tags into a UCoNNL feature list (universal morphological features). Returns a list of universal features.

Parameters:

  • pos: The SUC part-of-speech tag.
  • msd: The SUC MSD tag.
  • delim: The delimiter separating the features in msd. Default: "."

tagmappings.tags

Different sets of part-of-speech tags.

Miscellaneous Utils

sparv.api.util.misc provides miscellaneous util functions.

PickledLexicon

PickledLexicon(picklefile, verbose=True)

A class for reading a basic pickled lexicon and looking up keys.

PARAMETER DESCRIPTION
picklefile

A pathlib.Path or Model object pointing to the pickled lexicon.

TYPE: Path | Model

verbose

Whether to log status updates while reading the lexicon.

TYPE: bool DEFAULT: True

lookup

lookup(key, default=None)

Lookup a key in the lexicon.

PARAMETER DESCRIPTION
key

The key to look up.

TYPE: Any

default

The default value to return if the key is not found.

TYPE: Any DEFAULT: None

RETURNS DESCRIPTION
Any

The value for the key, or the default value if the key is not found.

dump_yaml

dump_yaml(
    data, resolve_alias=False, sort_keys=False, indent=2
)

Convert a dictionary to a YAML formatted string.

PARAMETER DESCRIPTION
data

The dictionary to be converted.

TYPE: dict

resolve_alias

Whether to replace aliases with their anchor's content.

TYPE: bool DEFAULT: False

sort_keys

Whether to sort the keys alphabetically.

TYPE: bool DEFAULT: False

indent

The number of spaces to use for indentation.

TYPE: int DEFAULT: 2

RETURNS DESCRIPTION
str

The YAML document as a string.

cwbset

cwbset(
    values,
    delimiter="|",
    affix="|",
    sort=False,
    maxlength=4095,
    encoding="UTF-8",
)

Take an iterable with strings and return a set in the format used by Corpus Workbench.

PARAMETER DESCRIPTION
values

An iterable containing string values.

TYPE: Iterable[str]

delimiter

The delimiter to be used between the values.

TYPE: str DEFAULT: '|'

affix

The affix enclosing the resulting string.

TYPE: str DEFAULT: '|'

sort

Whether to sort the values before joining them.

TYPE: bool DEFAULT: False

maxlength

Maximum length of the resulting string.

TYPE: int DEFAULT: 4095

encoding

Encoding to use when calculating the length of the string.

TYPE: str DEFAULT: 'UTF-8'

RETURNS DESCRIPTION
str

The joined string.

set_to_list

set_to_list(setstring, delimiter='|', affix='|')

Convert a set-formatted string into a list.

PARAMETER DESCRIPTION
setstring

The string to convert into a list. The string should be enclosed with affix characters and have elements separated by delimiter.

TYPE: str

delimiter

The character used to separate elements in setstring.

TYPE: str DEFAULT: '|'

affix

The character that encloses setstring.

TYPE: str DEFAULT: '|'

RETURNS DESCRIPTION
list[str]

A list of strings.

remove_control_characters

remove_control_characters(text, keep=('\n', '\t', '\r'))

Remove control characters from the given text, except for those specified in keep.

The characters removed are those with the Unicode category "Cc" (control characters). https://www.unicode.org/reports/tr44/#GC_Values_Table

PARAMETER DESCRIPTION
text

The string from which to remove control characters.

TYPE: str

keep

An iterable of characters to keep. Default is newline, tab, and carriage return.

TYPE: Iterable[str] DEFAULT: ('\n', '\t', '\r')

RETURNS DESCRIPTION
str

The text with control characters removed.

remove_formatting_characters

remove_formatting_characters(text, keep=())

Remove formatting characters from the given text, except for those specified in 'keep'.

The characters removed are those with the Unicode category "Cf" (formatting characters). https://www.unicode.org/reports/tr44/#GC_Values_Table

PARAMETER DESCRIPTION
text

The text from which to remove formatting characters.

TYPE: str

keep

An iterable of characters to keep.

TYPE: Iterable[str] DEFAULT: ()

RETURNS DESCRIPTION
str

The text with formatting characters removed.

remove_unassigned_characters

remove_unassigned_characters(text, keep=())

Remove unassigned characters from the given text, except for those specified in 'keep'.

The characters removed are those with the Unicode category "Cn" (unassigned characters). https://www.unicode.org/reports/tr44/#GC_Values_Table

PARAMETER DESCRIPTION
text

The text from which to remove unassigned characters.

TYPE: str

keep

An iterable of characters to keep.

TYPE: Iterable[str] DEFAULT: ()

RETURNS DESCRIPTION
str

The text with unassigned characters removed.

test_lexicon

test_lexicon(lexicon, testwords)

Test the validity of a lexicon by checking if specific test words are present as keys.

This function takes a dictionary (lexicon) and a list of test words, printing the value associated with each test word.

PARAMETER DESCRIPTION
lexicon

A dictionary representing the lexicon.

TYPE: dict

testwords

An iterable of strings, each expected to be a key in the lexicon.

TYPE: Iterable[str]

get_language_name_by_part3

get_language_name_by_part3(part3)

Return language name in English given an ISO 639-3 code.

PARAMETER DESCRIPTION
part3

ISO 639-3 code.

TYPE: str

RETURNS DESCRIPTION
str | None

Language name in English.

get_language_part1_by_part3

get_language_part1_by_part3(part3)

Return ISO 639-1 code given an ISO 639-3 code.

PARAMETER DESCRIPTION
part3

ISO 639-3 code.

TYPE: str

RETURNS DESCRIPTION
str | None

ISO 639-1 code.

parse_annotation_list

parse_annotation_list(
    annotation_names,
    all_annotations=None,
    add_plain_annotations=True,
)

Take a list of annotation names and possible export names, and return a list of tuples.

Each item in the list is split into a tuple by the string ' as '. Each tuple will contain two elements. If ' as ' is not present in the string, the second element will be None.

If the list of annotation names includes the element '...', all annotations from all_annotations will be included in the result, except those explicitly excluded in the list of annotations by being prefixed with 'not '.

If an annotation occurs more than once in the list, only the last occurrence will be kept. Similarly, if an annotation is first included and then excluded (using 'not') it will be excluded from the result.

If a plain annotation (without attributes) is excluded, all its attributes will be excluded as well.

Plain annotations (without attributes) will be added if needed, unless add_plain_annotations is set to False. Make sure to disable add_plain_annotations if the annotation names may include classes or config variables.

PARAMETER DESCRIPTION
annotation_names

A list of annotation names.

TYPE: Iterable[str] | None

all_annotations

A list of all possible annotations.

TYPE: Iterable[str] | None DEFAULT: None

add_plain_annotations

If True, plain annotations (without attributes) will be added if needed. Set to False if annotation names may include classes or config variables.

TYPE: bool DEFAULT: True

RETURNS DESCRIPTION
list[tuple[str, str | None]]

A list of tuples with annotation names and export names.

Error Messages and Logging

The SparvErrorMessage exception and get_logger function are essential components of the Sparv pipeline. Unlike other utilities mentioned on this page, they are located directly under sparv.api.

SparvErrorMessage

This exception class is used to halt the pipeline, while notifying users of errors in a user-friendly manner without displaying a traceback. Its usage is detailed in the Writing Sparv Plugins section.

Note

When raising this exception in a Sparv module, only the message argument should be used.

PARAMETER DESCRIPTION
message

User-friendly error message to display.

TYPE: str

module

The name of the module where the error occurred (optional, not used in Sparv modules).

TYPE: str DEFAULT: ''

function

The name of the function where the error occurred (optional, not used in Sparv modules).

TYPE: str DEFAULT: ''

get_logger

This function retrieves a logger that is a child of sparv.modules. Its usage is explained in the Writing Sparv Plugins section.

PARAMETER DESCRIPTION
name

The name of the current module (usually __name__).

TYPE: str

RETURNS DESCRIPTION
Logger

Logger object.