Utilities¶
Sparv provides a variety of utility functions, classes, and constants that are useful across different modules. These
utilities are primarily imported from sparv.api.util
and its submodules. For example:
from sparv.api.util.system import call_binary
Constants¶
The sparv.api.util.constants
module includes several predefined constants that are used throughout the Sparv pipeline:
DELIM = "|"
: Delimiter character used to separate ambiguous results.AFFIX = "|"
: Character used to enclose results, marking them as a set.SCORESEP = ":"
: Character that separates an annotation from its score.COMPSEP = "+"
: Character used to separate parts of a compound.UNDEF = "__UNDEF__"
: Value representing undefined annotations.OVERLAP_ATTR = "overlap"
: Name for automatically created overlap attributes.SPARV_DEFAULT_NAMESPACE = "sparv"
: Default namespace used when annotation names collide andsparv_namespace
is not set in the configuration.UTF8 = "UTF-8"
: UTF-8 encoding.LATIN1 = "ISO-8859-1"
: Latin-1 encoding.HEADER_CONTENTS = "contents"
: Name of the annotation containing header contents.
Export Utils¶
sparv.api.util.export
provides utility functions for preparing data for export.
gather_annotations ¶
gather_annotations(
annotations,
export_names,
header_annotations=None,
source_file=None,
flatten=True,
split_overlaps=False,
)
Calculate the span hierarchy and the annotation_dict
containing all annotation elements and attributes.
PARAMETER | DESCRIPTION |
---|---|
annotations
|
List of annotations to include.
TYPE:
|
export_names
|
Dictionary that maps from annotation names to export names.
TYPE:
|
header_annotations
|
List of header annotations.
TYPE:
|
source_file
|
The source filename.
TYPE:
|
flatten
|
Whether to return the spans as a flat list.
TYPE:
|
split_overlaps
|
Whether to split up overlapping spans.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
list[tuple]
|
A |
dict[str, dict]
|
|
RAISES | DESCRIPTION |
---|---|
SparvErrorMessage
|
If the source file is not found for header annotations. |
calculate_element_hierarchy ¶
calculate_element_hierarchy(source_file, spans_list)
Calculate the hierarchy for spans with identical start and end positions.
If two spans A and B have identical start and end positions, go through all occurrences of A and B and check which element is most often parent to the other.
PARAMETER | DESCRIPTION |
---|---|
source_file
|
The source filename.
TYPE:
|
spans_list
|
List of spans to check for hierarchy.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
dict[str, dict[str, int]]
|
A dictionary with the hierarchy of spans. |
get_annotation_names ¶
get_annotation_names(
annotations,
source_annotations=None,
source_file=None,
token_name=None,
remove_namespaces=False,
keep_struct_names=False,
sparv_namespace=None,
source_namespace=None,
xml_mode=False,
)
Get a list of annotations, token attributes, and a dictionary translating annotation names to export names.
PARAMETER | DESCRIPTION |
---|---|
annotations
|
List of elements:attributes (annotations) to include, with possible export names.
TYPE:
|
source_annotations
|
List of elements:attributes from the source file to include, with possible export names. If not specified, includes everything.
TYPE:
|
source_file
|
Name of the source file.
TYPE:
|
token_name
|
Name of the token annotation.
TYPE:
|
remove_namespaces
|
Set to
TYPE:
|
keep_struct_names
|
Set to
TYPE:
|
sparv_namespace
|
Namespace to add to all Sparv annotations.
TYPE:
|
source_namespace
|
Namespace to add to all annotations from the source file.
TYPE:
|
xml_mode
|
Set to
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
list[Annotation | AnnotationAllSourceFiles]
|
A list of annotations, a list of token attribute names, a dictionary with translation from annotation names to |
list[str]
|
export names. |
get_header_names ¶
get_header_names(header_annotations, xml_namespaces)
Get a list of header annotations and a dictionary for renamed annotations.
PARAMETER | DESCRIPTION |
---|---|
header_annotations
|
List of header annotations from the source file to include. If not specified, includes everything.
TYPE:
|
xml_namespaces
|
XML namespaces to use for the header annotations.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
tuple[list[Annotation], dict[str, str]]
|
A list of header annotations and a dictionary with translation from annotation names to export names. |
scramble_spans ¶
scramble_spans(span_positions, chunk_name, chunk_order)
Reorder spans based on chunk_order
and ensure tags are opened and closed correctly.
PARAMETER | DESCRIPTION |
---|---|
span_positions
|
Original span positions, typically obtained from
TYPE:
|
chunk_name
|
Name of the annotation to reorder.
TYPE:
|
chunk_order
|
Annotation specifying the new order of the chunks.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
list[tuple]
|
List of tuples with the new span positions and instructions. |
Install/Uninstall Utils¶
sparv.api.util.install
provides functions for installing and uninstalling corpora, either locally or remotely.
install_path ¶
install_path(source_path, host, target_path)
Transfer a file or the contents of a directory to a target destination, optionally on a different host.
PARAMETER | DESCRIPTION |
---|---|
source_path
|
Path to the local file or directory to sync. If a directory is specified, its contents are synced, not the directory itself, and any extraneous files in destination directories are deleted.
TYPE:
|
host
|
Remote host to install to. Set to
TYPE:
|
target_path
|
Path to the target file or directory.
TYPE:
|
uninstall_path ¶
uninstall_path(path, host=None)
Remove a file or directory, optionally on a different host.
PARAMETER | DESCRIPTION |
---|---|
path
|
Path to the file or directory to remove.
TYPE:
|
host
|
Remote host where the file or directory is located. Set to
TYPE:
|
install_mysql ¶
install_mysql(host, db_name, sqlfile)
Insert tables and data from one or more SQL files into a local or remote MySQL database.
PARAMETER | DESCRIPTION |
---|---|
host
|
The remote host to install to. Set to
TYPE:
|
db_name
|
The name of the database.
TYPE:
|
sqlfile
|
The path to a SQL file, or a list of paths to multiple SQL files.
TYPE:
|
install_mysql_dump ¶
install_mysql_dump(host, db_name, tables)
Copy selected tables, including their data, from a local MySQL database to a remote one.
PARAMETER | DESCRIPTION |
---|---|
host
|
The remote host to install to.
TYPE:
|
db_name
|
The name of the remote database.
TYPE:
|
tables
|
A table name or a list of table names. If a list is provided, the tables are separated by spaces.
TYPE:
|
install_svn ¶
install_svn(source_file, svn_url, remove_existing=False)
Check in a file to an SVN repository.
If the file is already in the repository, it will be deleted and added again.
PARAMETER | DESCRIPTION |
---|---|
source_file
|
The file to check in.
TYPE:
|
svn_url
|
The URL to the SVN repository, including the path to the file.
TYPE:
|
remove_existing
|
If False, this function can only be used to add new files to the repository. If True, existing files will be deleted before the import.
TYPE:
|
RAISES | DESCRIPTION |
---|---|
SparvErrorMessage
|
If the source_file does not exist, if svn_url is not set, if it is not possible to list or delete the file in the SVN repository, if remove_existing is set to False and source_file already exists in the repository, or if it is not possible to import the file to the SVN repository. |
uninstall_svn ¶
uninstall_svn(svn_url)
Delete a file from an SVN repository.
PARAMETER | DESCRIPTION |
---|---|
svn_url
|
The URL to the SVN repository including the name of the file to remove.
TYPE:
|
RAISES | DESCRIPTION |
---|---|
SparvErrorMessage
|
If svn_url is not set, or if deletion fails. |
install_git ¶
install_git(source_file, repo_path, commit_message=None)
Copy a file to a local Git repository and make a commit.
PARAMETER | DESCRIPTION |
---|---|
source_file
|
The file to copy.
TYPE:
|
repo_path
|
The path to the local Git repository.
TYPE:
|
commit_message
|
The commit message. If not set, a default message will be used.
TYPE:
|
RAISES | DESCRIPTION |
---|---|
SparvErrorMessage
|
If the source file does not exist, if repo_path is not set, or if it is not possible to add the file to the Git repository. |
uninstall_git ¶
uninstall_git(file_path, commit_message=None)
Remove a file from a local Git repository and make a commit.
PARAMETER | DESCRIPTION |
---|---|
file_path
|
The path to file to remove.
TYPE:
|
commit_message
|
The commit message. If not set, a default message will be used.
TYPE:
|
RAISES | DESCRIPTION |
---|---|
SparvErrorMessage
|
If repo_path is not set, if the file does not exist in the Git repository, or if it is not possible to remove the file from the Git repository. |
System Utils¶
sparv.api.util.system
provides functions for managing processes, creating directories, and more.
kill_process ¶
kill_process(process)
Terminate a process, ignoring any errors if the process is already terminated.
PARAMETER | DESCRIPTION |
---|---|
process
|
The process to be terminated.
TYPE:
|
RAISES | DESCRIPTION |
---|---|
OSError
|
If an error occurs while killing the process. |
clear_directory ¶
clear_directory(path)
Create a new empty directory at the given path, and remove its contents if it already exists.
PARAMETER | DESCRIPTION |
---|---|
path
|
The path where the directory should be created.
TYPE:
|
call_java ¶
call_java(
jar,
arguments,
options=(),
stdin="",
search_paths=(),
encoding=None,
verbose=False,
return_command=False,
)
Execute a Java program using a specified jar file, command line arguments, and stdin
input.
PARAMETER | DESCRIPTION |
---|---|
jar
|
The name of the jar file to execute.
TYPE:
|
arguments
|
A list of arguments to pass to the Java program.
TYPE:
|
options
|
A list of Java options to include in the call.
TYPE:
|
stdin
|
Input to pass to the program's
TYPE:
|
search_paths
|
Additional paths to search for the Java binary, in addition to the environment variable PATH.
TYPE:
|
encoding
|
The encoding to use for
TYPE:
|
verbose
|
If
TYPE:
|
return_command
|
If
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
tuple[str, str] | Popen
|
A tuple with |
tuple[str, str] | Popen
|
If |
call_binary ¶
call_binary(
name,
arguments=(),
stdin="",
raw_command=None,
search_paths=(),
encoding=None,
verbose=False,
use_shell=False,
allow_error=False,
return_command=False,
)
Call a binary with specified arguments and stdin
.
PARAMETER | DESCRIPTION |
---|---|
name
|
The binary to execute (can include absolute or relative path). Accepts a string, a Path, or an iterable of strings or Paths, using the first found binary.
TYPE:
|
arguments
|
List of arguments to pass to the binary.
TYPE:
|
stdin
|
Input to pass to the process's
TYPE:
|
raw_command
|
A raw command to execute through the shell (implies
TYPE:
|
search_paths
|
Additional paths to search for the binary, besides the environment variable
TYPE:
|
encoding
|
Encoding to use for
TYPE:
|
verbose
|
If
TYPE:
|
use_shell
|
If
TYPE:
|
allow_error
|
If
TYPE:
|
return_command
|
If
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
tuple[str, str] | Popen
|
A tuple with |
tuple[str, str] | Popen
|
If |
RAISES | DESCRIPTION |
---|---|
OSError
|
If an error occurs while calling the binary. |
find_binary ¶
find_binary(
name,
search_paths=(),
executable=True,
allow_dir=False,
raise_error=False,
)
Locate the binary for a given program.
PARAMETER | DESCRIPTION |
---|---|
name
|
The name of the binary, either as a string or Path, or an iterable of strings or Paths with alternative names.
TYPE:
|
search_paths
|
A list of additional paths to search, besides those in the environment variable
TYPE:
|
executable
|
If
TYPE:
|
allow_dir
|
If
TYPE:
|
raise_error
|
If
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
str | None
|
The path to binary, or |
RAISES | DESCRIPTION |
---|---|
SparvErrorMessage
|
If |
rsync ¶
rsync(local, host, remote)
Transfer files and directories using rsync.
When syncing a directory, extraneous files in the destination directory are deleted, and it is always the contents of the source directory that are synced, not the directory itself (i.e. the rsync source directory is always suffixed with a slash).
PARAMETER | DESCRIPTION |
---|---|
local
|
The file or directory to transfer.
TYPE:
|
host
|
The remote host to transfer to. Set to
TYPE:
|
remote
|
The path to the target file or directory.
TYPE:
|
remove_path ¶
remove_path(path, host=None)
Remove a file or directory, either locally or remotely.
PARAMETER | DESCRIPTION |
---|---|
path
|
The file or directory to remove.
TYPE:
|
host
|
The remote host to remove from. Leave as
TYPE:
|
gpus ¶
gpus(reorder=True)
Get a list of available GPUs, sorted by free memory in descending order.
Only works for NVIDIA GPUs, and requires the nvidia-smi
utility to be installed.
If reorder
is True
(default), the GPUs are renumbered according to the order specified in the environment
variable CUDA_VISIBLE_DEVICES
. For example, if CUDA_VISIBLE_DEVICES=1,0
, and the GPUs with most free memory are
0, 1, the function will return [1, 0]
.
This is needed for PyTorch, which uses the GPU indices as specified in CUDA_VISIBLE_DEVICES
, not the actual GPU
indices. In the example above, PyTorch would consider GPU 1 as GPU 0 and GPU 0 as GPU 1.
PARAMETER | DESCRIPTION |
---|---|
reorder
|
Whether to renumber the GPUs according to the order in the environment variable
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
list[int] | None
|
A list of GPU indices, or None if no GPUs are available or if the nvidia-smi command failed. |
call_svn ¶
call_svn(command, *args)
Call an SVN command.
Will try to authenticate with SVN_USERNAME and SVN_PASSWORD environment variables if set.
PARAMETER | DESCRIPTION |
---|---|
command
|
The SVN command to call.
TYPE:
|
*args
|
Additional arguments for the SVN command.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
int
|
The return code from the command.
TYPE:
|
RAISES | DESCRIPTION |
---|---|
SparvErrorMessage
|
If the SVN command fails. |
FileNotFoundError
|
If the file is not found in SVN. |
Tag Sets¶
The sparv.api.util.tagsets
subpackage includes modules with functions and objects for tag set conversions.
Join a complex tag into a string.
The tag can be a dict {"pos": pos, "msd": msd} or a tuple (pos, msd).
PARAMETER | DESCRIPTION |
---|---|
tag
|
A tag to join.
TYPE:
|
sep
|
A separator between parts of the tag in the result.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
str
|
The joined tag. |
tagmappings.join_tag()¶
Convert a complex SUC or SALDO tag record into a string.
Parameters:
tag
: The tag to convert, which can be a dictionary ({'pos': pos, 'msd': msd}
) or a tuple ((pos, msd)
).sep
: The separator to use. Default: "."
tagmappings.mappings¶
Mappings of part-of-speech tags between different tag sets.
pos_to_upos()¶
Map part-of-speech tags to Universal Dependency part-of-speech tags. This function only works if there is a conversion
function in util.tagsets.pos_to_upos
for the specified language and tag set.
Parameters:
pos
: The part-of-speech tag to convert.lang
: The language code.tagset
: The name of the tag set to whichpos
belongs.
tagmappings.split_tag()¶
Split a SUC or Saldo tag string ('X.Y.Z') into a tuple ('X', 'Y.Z'), where 'X' is the part of speech and 'Y', 'Z', etc., are morphological features (i.e., MSD tags).
Parameters:
tag
: The tag string to split into a tuple.sep
: The separator to split on. Default: "."
suc_to_feats()¶
Convert SUC MSD tags into a UCoNNL feature list (universal morphological features). Returns a list of universal features.
Parameters:
pos
: The SUC part-of-speech tag.msd
: The SUC MSD tag.delim
: The delimiter separating the features inmsd
. Default: "."
tagmappings.tags¶
Different sets of part-of-speech tags.
Miscellaneous Utils¶
sparv.api.util.misc
provides miscellaneous util functions.
PickledLexicon ¶
PickledLexicon(picklefile, verbose=True)
A class for reading a basic pickled lexicon and looking up keys.
PARAMETER | DESCRIPTION |
---|---|
picklefile
|
A
TYPE:
|
verbose
|
Whether to log status updates while reading the lexicon.
TYPE:
|
lookup ¶
lookup(key, default=None)
Lookup a key in the lexicon.
PARAMETER | DESCRIPTION |
---|---|
key
|
The key to look up.
TYPE:
|
default
|
The default value to return if the key is not found.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Any
|
The value for the key, or the default value if the key is not found. |
dump_yaml ¶
dump_yaml(
data, resolve_alias=False, sort_keys=False, indent=2
)
Convert a dictionary to a YAML formatted string.
PARAMETER | DESCRIPTION |
---|---|
data
|
The dictionary to be converted.
TYPE:
|
resolve_alias
|
Whether to replace aliases with their anchor's content.
TYPE:
|
sort_keys
|
Whether to sort the keys alphabetically.
TYPE:
|
indent
|
The number of spaces to use for indentation.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
str
|
The YAML document as a string. |
cwbset ¶
cwbset(
values,
delimiter="|",
affix="|",
sort=False,
maxlength=4095,
encoding="UTF-8",
)
Take an iterable with strings and return a set in the format used by Corpus Workbench.
PARAMETER | DESCRIPTION |
---|---|
values
|
An iterable containing string values.
TYPE:
|
delimiter
|
The delimiter to be used between the values.
TYPE:
|
affix
|
The affix enclosing the resulting string.
TYPE:
|
sort
|
Whether to sort the values before joining them.
TYPE:
|
maxlength
|
Maximum length of the resulting string.
TYPE:
|
encoding
|
Encoding to use when calculating the length of the string.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
str
|
The joined string. |
set_to_list ¶
set_to_list(setstring, delimiter='|', affix='|')
Convert a set-formatted string into a list.
PARAMETER | DESCRIPTION |
---|---|
setstring
|
The string to convert into a list. The string should be enclosed with
TYPE:
|
delimiter
|
The character used to separate elements in
TYPE:
|
affix
|
The character that encloses
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
list[str]
|
A list of strings. |
remove_control_characters ¶
remove_control_characters(text, keep=('\n', '\t', '\r'))
Remove control characters from the given text, except for those specified in keep
.
The characters removed are those with the Unicode category "Cc" (control characters). https://www.unicode.org/reports/tr44/#GC_Values_Table
PARAMETER | DESCRIPTION |
---|---|
text
|
The string from which to remove control characters.
TYPE:
|
keep
|
An iterable of characters to keep. Default is newline, tab, and carriage return.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
str
|
The text with control characters removed. |
remove_formatting_characters ¶
remove_formatting_characters(text, keep=())
Remove formatting characters from the given text, except for those specified in 'keep'.
The characters removed are those with the Unicode category "Cf" (formatting characters). https://www.unicode.org/reports/tr44/#GC_Values_Table
PARAMETER | DESCRIPTION |
---|---|
text
|
The text from which to remove formatting characters.
TYPE:
|
keep
|
An iterable of characters to keep.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
str
|
The text with formatting characters removed. |
remove_unassigned_characters ¶
remove_unassigned_characters(text, keep=())
Remove unassigned characters from the given text, except for those specified in 'keep'.
The characters removed are those with the Unicode category "Cn" (unassigned characters). https://www.unicode.org/reports/tr44/#GC_Values_Table
PARAMETER | DESCRIPTION |
---|---|
text
|
The text from which to remove unassigned characters.
TYPE:
|
keep
|
An iterable of characters to keep.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
str
|
The text with unassigned characters removed. |
test_lexicon ¶
test_lexicon(lexicon, testwords)
Test the validity of a lexicon by checking if specific test words are present as keys.
This function takes a dictionary (lexicon) and a list of test words, printing the value associated with each test word.
PARAMETER | DESCRIPTION |
---|---|
lexicon
|
A dictionary representing the lexicon.
TYPE:
|
testwords
|
An iterable of strings, each expected to be a key in the lexicon.
TYPE:
|
get_language_name_by_part3 ¶
get_language_name_by_part3(part3)
Return language name in English given an ISO 639-3 code.
PARAMETER | DESCRIPTION |
---|---|
part3
|
ISO 639-3 code.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
str | None
|
Language name in English. |
get_language_part1_by_part3 ¶
get_language_part1_by_part3(part3)
Return ISO 639-1 code given an ISO 639-3 code.
PARAMETER | DESCRIPTION |
---|---|
part3
|
ISO 639-3 code.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
str | None
|
ISO 639-1 code. |
parse_annotation_list ¶
parse_annotation_list(
annotation_names,
all_annotations=None,
add_plain_annotations=True,
)
Take a list of annotation names and possible export names, and return a list of tuples.
Each item in the list is split into a tuple by the string ' as '. Each tuple will contain two elements. If ' as ' is
not present in the string, the second element will be None
.
If the list of annotation names includes the element '...', all annotations from all_annotations
will be included
in the result, except those explicitly excluded in the list of annotations by being prefixed with 'not '.
If an annotation occurs more than once in the list, only the last occurrence will be kept. Similarly, if an annotation is first included and then excluded (using 'not') it will be excluded from the result.
If a plain annotation (without attributes) is excluded, all its attributes will be excluded as well.
Plain annotations (without attributes) will be added if needed, unless add_plain_annotations is set to False. Make sure to disable add_plain_annotations if the annotation names may include classes or config variables.
PARAMETER | DESCRIPTION |
---|---|
annotation_names
|
A list of annotation names.
TYPE:
|
all_annotations
|
A list of all possible annotations.
TYPE:
|
add_plain_annotations
|
If
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
list[tuple[str, str | None]]
|
A list of tuples with annotation names and export names. |
Error Messages and Logging¶
The SparvErrorMessage
exception and get_logger
function are essential components of the Sparv pipeline. Unlike other
utilities mentioned on this page, they are located directly under sparv.api
.
SparvErrorMessage¶
This exception class is used to halt the pipeline, while notifying users of errors in a user-friendly manner without displaying a traceback. Its usage is detailed in the Writing Sparv Plugins section.
Note
When raising this exception in a Sparv module, only the message
argument should be used.
PARAMETER | DESCRIPTION |
---|---|
message
|
User-friendly error message to display.
TYPE:
|
module
|
The name of the module where the error occurred (optional, not used in Sparv modules).
TYPE:
|
function
|
The name of the function where the error occurred (optional, not used in Sparv modules).
TYPE:
|
get_logger¶
This function retrieves a logger that is a child of sparv.modules
. Its usage is explained in the Writing Sparv
Plugins section.
PARAMETER | DESCRIPTION |
---|---|
name
|
The name of the current module (usually
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Logger
|
Logger object. |