Skip to content

Command Line Interface

This section provides an overview of the Sparv command line interface (CLI) and its key commands.

Running sparv without any arguments will display all the available Sparv commands:

Annotating a corpus:
   run              Annotate a corpus and generate export files
   install          Install a corpus
   uninstall        Uninstall a corpus
   clean            Remove output directories

Inspecting corpus details:
   config           Display the corpus configuration
   files            List available corpus source files that can be annotated by Sparv

Show annotation info:
   modules          List available modules and annotations
   presets          List available annotation presets
   classes          List available annotation classes
   languages        List supported languages

Setting up Sparv:
   setup            Set up the Sparv data directory
   plugins          Manage Sparv plugins
   wizard           Run config wizard to create a corpus config
   build-models     Download and build the Sparv models

Advanced commands:
   run-rule         Run specified rule(s) for creating annotations
   create-file      Create specified file(s)
   preload          Preload annotators and models
   autocomplete     Enable tab completion in bash/zsh
   schema           Print a JSON schema for the Sparv config format

Each command in the Sparv command line interface comes with a help text accessible via the -h flag. Below is an overview of the key commands in Sparv, but for more detailed information about the parameters and options available for each command, use the -h flag in the terminal.

Annotating a Corpus

All of the following commands should be run from inside a corpus directory.

sparv run

sparv run is the primary command for annotating a corpus. It initiates the annotation process and generates all the output formats (or exports) specified under export.default in your config file. Alternatively, you can specify a particular export format, for example, sparv run csv_export:csv. To see all available output formats for your corpus, use sparv run -l. The generated output files will be stored in an export directory within your corpus directory.

By using the -j flag, you can specify the number of parallel processes to use during the annotation process. For example, sparv run -j 4 will run the annotation process with four parallel processes. The default value is 1.

sparv install

Installing a corpus involves deploying it either locally or on a remote server. Sparv natively supports the deployment of XML exports, CWB data files, SQL data, and more. When you run sparv install, Sparv checks if all necessary annotations are present. If any annotations are missing, Sparv will create them for you, so you don't necessarily need to annotate the corpus beforehand. To see the available installation options, use sparv install -l.

sparv clean

During the annotation process, Sparv creates a sparv-workdir directory within your corpus directory. This directory contains intermediate files that usually speed up subsequent processing. However, if you need to free up disk space or want to rerun all annotations from scratch, you can delete this directory by running sparv clean. Additionally, you can remove the export directory and log files by adding the appropriate flags. For more options, check sparv clean -h.

Inspecting Corpus Details

sparv config

Displays the complete configuration for your corpus, including all default values. More information can be found in the corpus configuration section.

sparv files

This command displays a list of all source files available in your corpus. The files are shown without their extensions, which is also the format used when referencing specific files for annotation with the --file argument in the run command.

Show Annotation Info

All of the following commands except sparv language should be run from inside a corpus directory. The output of these commands will differ depending on the corpus language configured in the corpus configuration file.

sparv modules

The sparv modules command lists all available modules and annotations for the language specified in the corpus configuration file. This command is useful for finding the names of annotations you want to include in your corpus configuration, and available options for each module.

sparv presets

This command lists all annotation presets available for the current language. For more details, see the Annotation Presets section.

sparv classes

This command lists all available annotation classes for the current language. For more information, refer to the Annotation Classes section.

sparv languages

This command lists all languages supported by Sparv. It can be run from any directory.

Setting Up Sparv

sparv setup and sparv build-models

These commands are used to set up the Sparv data directory and download and build the Sparv models, respectively. They are detailed in the Setting Up Sparv section.

Advanced Commands

sparv run-rule and sparv create-file

These commands allow you to run specific annotation processors or create specific files. They are mostly useful for debugging and testing. You can provide multiple arguments to these commands.

Example of running the Stanza annotations (part-of-speech tagging and dependency parsing) for all source files:

sparv run-rule stanza:annotate

Example of creating the part-of-speech annotation for the source file document1:

sparv create-file sparv-workdir/document1/segment.token/stanza.pos

sparv preload

This command preloads annotators, their models, and related binaries to speed up the annotation process. This is particularly useful when working with multiple smaller source files, as it prevents the need to load models repeatedly for each file. Note that not all annotators support preloading; use the --list argument to see which annotators are supported.

The Sparv preloader can be run from any directory containing a config.yaml file. While this file follows the same format as corpus configuration files, it doesn't need to be tied to a specific corpus. The only requirement is a preload: section listing the annotators to preload (as provided by the --list command). These annotators will be loaded using the settings in the configuration file, combined with default settings as needed.

The preloader can be shared across multiple corpora, provided the annotator configurations are consistent. For example, a preloaded annotator configured to use model A, can not be used to annotate a corpus where the annotator is configured to use model B. If Sparv detects a configuration mismatch, it will automatically revert to not using the preloader for that annotator.

The preloader uses socket files for communication. Use the --socket argument to specify the path to the socket file that will be created. If omitted, a sparv.socket will be created in the current directory.

The --processes argument specifies the number of parallel processes to start. Ideally, this should match the number of processes you plan to use when running Sparv (e.g., sparv run -j 4) to avoid bottlenecks.

Example of starting the preloader with four parallel processes:

sparv preload --socket my_socket.sock --processes 4

Once the preloader is running, use another terminal to annotate your corpus. To make Sparv use the preloader, use the --socket argument and point it to the same socket file created by the preloader. For example:

sparv run --socket my_socket.sock

If the preloader is busy, Sparv will default to running annotators the regular way. To force Sparv to wait for the preloader, use the --force-preloader flag with the run command.

To shut down the preloader, either press Ctrl-C in the preloader terminal or use the following command, specifying the relevant socket:

sparv preload stop --socket my_socket.sock