Quick Start¶
This guide will help you get started with Sparv in just a few minutes, and walk you through the process of annotating your first corpus. For a more detailed installation guide and user manual, please refer to the other sections of this documentation.
Note
Sparv is a command line application and all interaction in this quick start guide takes place in a terminal.
This guide should work both in a Unix-like environment and the Windows command line.
Installation¶
First, ensure that you have Python 3.10 or newer installed by running the following command in your terminal:
python3 --version
Note
On some systems, the command may be called python
instead of python3
.
Next, install pipx if it's not already installed:
python3 -m pip install --user pipx
python3 -m pipx ensurepath
Once pipx is installed, run the following command to install Sparv:
pipx install sparv
Verify that the installation was successful by running sparv
, which should display Sparv's command-line help:
sparv
Finally, complete the setup by running the sparv setup
command to
choose where Sparv will save its models and configuration:
sparv setup
Creating a Corpus¶
With Sparv installed, let's try it out on a small corpus.
Each corpus needs its own directory, so start by creating one called my_corpus
:
mkdir my_corpus
cd my_corpus
Inside this directory, create another directory called source
, where we will put the corpus source files (the files
containing the text we want to annotate):
mkdir source
Using your favourite plain text editor (i.e. not Word), create a source file in XML format and place it in the source
directory. Make sure to save it with UTF-8 encoding.
document.xml
<text title="My first corpus document" author="me">
Ord, ord, ord. Här kommer några fler ord.
</text>
Note
The source
directory may contain as many files as you want, but let's start with just this one.
Creating the Config File¶
For Sparv to know what to do with your corpus, you need to create a configuration file. You can use the corpus config wizard or write it manually. For this guide, we'll write it by hand.
Create a file called config.yaml
directly in your corpus directory and save it with UTF-8 encoding. Your directory
structure should now look like this:
my_corpus/
├── config.yaml
└── source/
└── document.xml
Add the following content to the configuration file and save it:
metadata:
language: swe
import:
importer: xml_import:parse
export:
annotations:
- <sentence>
- <token>
The configuration file consists of several sections, each containing configuration variables and their values. First,
we specify the corpus language (Swedish). Second, in the import
section, we specify which of
Sparv's importer modules to use (we want the one for XML). Finally, in the export
section, we list what
automatic annotations we want Sparv to add. For this simple corpus we only ask for sentence segmentation and
tokenization.
Running Sparv¶
If you have followed the steps above, everything should now be ready. Make sure that you are in the my_corpus
folder,
and then run Sparv:
sparv run
After a short while, Sparv will tell you where the resulting files are saved. Let's have a look at one of them:
export/xml_export.pretty/document_export.xml
<?xml version='1.0' encoding='UTF-8'?>
<text author="me" title="My first corpus document">
<sentence>
<token>Ord</token>
<token>,</token>
<token>ord</token>
<token>,</token>
<token>ord</token>
<token>.</token>
</sentence>
<sentence>
<token>Här</token>
<token>kommer</token>
<token>några</token>
<token>fler</token>
<token>ord</token>
<token>.</token>
</sentence>
</text>
What's Next?¶
Try adding some more annotations to your corpus by extending the annotations list in the corpus configuration file. To
explore available annotations, use the sparv modules
command, or see the Available Analyses
section in the documentation. You can also try out the corpus configuration wizard by running sparv wizard
.
It is also possible to annotate texts in other languages, such as English. Just change language: swe
to
language: eng
in the configuration file. Run sparv languages
to see all supported languages.
Note
Some annotations may require additional software to be installed before you can use them.