Web Sparv is a web interface for an older version of the Sparv pipeline (Sparv 3). If you are interested in the new version of the Pipeline, click here.
The source code for Sparv's graphical user interface is available at GitHub. It is divided into three parts:
- frontend: a javascript application
- backend: a Flask application, providing the REST-API
- catapult: an auxiliary tool for running the Sparv pipeline
The code is distributed under the MIT license. Please refer to the above repositories for installation instructions.
The Sparv pipeline is also required for running the Sparv interface.
Technical Documentation of the frontend
This section describes the internal structure of the Sparv frontend and can be used as a developer's guide.
The frontend is available at https://spraakbanken.gu.se/sparv/.
Requirements
- Node v0.10.x, which should be installed through you package manager. This should by default include NPM, the node package manager.
- Grunt, install using
npm install -g grunt-cli
. - CoffeeScript, install using
npm install -g coffeescript
. - Sass, install using
npm install sass
.
Frontend Configuration
The app/config.js
file contains the configuration of the backend address, the
address to the default settings JSON schema and also the address to Karp.
Running the Frontend
For running the frontend locally (while developing) run grunt serve
.
In you browser, open http://localhost:9010
to launch Sparv.
While running grunt serve
the CoffeeScript and Sass files are automatically
compiled upon edit, additionally causing the browser window to be reloaded to
reflect the new changes.
Before releasing a new version, the scripts are compiled by running grunt
in the frontend directory.
This will create a directory called dist
which contains all the files necessary
to run the frontend.
Technical Documentation of the backend
This section describes the internal structure of the Sparv backend and its usage of the catapult and can be used as a developer's guide.
Backend
The backend is run through the WSGI script index.py
.
It is available at https://spraakbanken.gu.se/ws/sparv.
Requirements
- Version 3 of the Sparv corpus pipeline (see the technical report for installation instructions)
- Python 3.4 or newer
Python virtual environment
Though it is not required, we recommend that you use a Python virtual environment to run the Sparv backend. This is the easiest way to ensure that you have all the Python dependencies needed to run the modules.
Set up a Python virtual environment as a subdirectory to the backend/html/app
directory:
python3 -m venv venv
Activate the virtual environment and install the required Python packages:
source venv/bin/activate
pip install -r requirements.txt
You can then deactivate the virtual environment:
deactivate
Backend Configuration
The configuration variables are stored in html/app/config.py
:
backend
: the address where the backend is hostedsparv_python
: the Python path to the sparv pipeline python directorysparv_backend
: the path to the backend directorybuilds_dir
: the directory that hosts running and completed buildslog_dir
: location of the log files. Can be set toNone
to log to stdout.sparv_models
: location of the sparv pipeline modelssparv_makefiles
: location of the sparv pipeline makefilessecret_key
: a string of your choosing. It is needed for queries that may cause deletions of builds.venv_path
: path to the activation script of the Python virtual environment (may be set to None)processes
: number of processes thatmake
will run while annotatingfileupload_ext
: extension used for builds that contain file uploadssocket_file
: path to the socket file used to communicate with the catapultcatalaunch_binary
: path to the catalaunch binary filepython_interpreter
: "Python" interpreter, replaced with catalaunch
When running the backend with gunicorn (recommended) you may also want to modify
the configuration in html/app/gunicorn_config.py
.
The suggested location for the Sparv pipeline is the directory data/pipeline
.
Running the Backend
The backend is set up to be run with gunicorn which is installed automatically inside the Python virtual environment. From the backend directory you can run the following command:
html/app/venv/bin/gunicorn -c html/app/gunicorn_config.py index
This will start a WSGI server and bind it to the socket defined in gunicorn_config.py
.
Log messages are written to the file specified in the same file (or to the terminal
if nothing is specified).
The backend can also be started by running index.py
with the Python interpreter but this is
mostly used for development or debugging.
Makefile and Settings JSON Schema
The makefile for each corpus is created from a JSON object that is created by the
script html/app/schema_generator.py
.
The frontend builds its form based on the requested schema. New entries can be added
and the frontend should render them automatically. The file that creates the makefile is
html/app/make_makefile.py
.
Catapult
The catapult runs a Python instance that shares the loaded lexicons, keeps
malt processes running and lowers the Python interpreter startup time.
Scripts are run on the catapult with the tiny c program catalaunch
.
Requirements
- GCC for compiling the
catapult
C extension - Python 3.4 or newer
Catapult setup
The catapult can for example be placed inside the data
directory.
Just as for the Sparv backend we recomment that you use a Python virtual environment
for running the catapult. Check the backend requirements for instructions.
The standard location for installing the virtual environment for the catapult is inside the
catapult
directory.
After setting up the Python virtual environment you need to adapt the variable VENV_PATH
in the Sparv pipeline in /makefiles/Makefile.config
so it points to the catapult virtual environment.
Catapult Configuration
The configuration variables are stored in config.sh
:
SPARV_PYTHON
: the path to thesb
pipeline Python directorySPARV_MODELS
: location of thesb
pipeline modelsSPARV_BIN
: location of thesb
pipeline binariesSPARV_MAKEFILES
: location of the pipeline makefilesCATAPULT_DIR
: location of the catapult directoryBUILDS_DIR
: the directory that hosts running and completed buildsLOGDIR
: the path to the log file directoryCATAPULT_VENV
: the path to the Python virtual environment used by the catapult
Running the Catapult
- Run
make
to buildcatalaunch
. - Run
./start-server.sh
to start the catapult. - Set up the cron jobs listed in
catapult/cronjobs
. for the automatic maintenance of Sparv.
Cron jobs
The following cron jobs are used in Sparv:
- Cleanup: Builds that have not been accessed for 7 days are removed every midnight by
issuing
https://ws.spraakbanken.gu.se/ws/sparv/cleanup?secret_key=SECRETKEY
. - Keep-alive: The script
catapult/keep-alive.sh
is run every five minutes and restarts the catapult withcatapult/start-server.sh
if it does not respond to ping. Instead of setting up this cron job you can run the catapult using a process control system like supervisord. - Update-saldo: The Saldo lexicon is updated daily with the script
catapult/update-saldo.sh
. This takes some time, and is therefore run during the night. The catapult is restarted afterwards by thekeep-alive.sh
script.
Interaction between the Sparv components
The following image illustrates how the components involved in the Sparv web interface interact with each other and with the user, both before and during the analysis.