Hoppa till huvudinnehåll
Språkbanken Text är en avdelning inom Språkbanken.

Mink user manual

Contents

  1. Create a corpus and upload files
  2. Source files
  3. Corpus settings
  4. Metadata
  5. Run automatic annotation
  6. Download results
  7. Install results into tools

Create a corpus and upload files

Any text file you upload to Mink must belong to a corpus – a collection of texts. You can use it to group related text files, and you can create as many corpora as you want.

You can create a corpus and upload your text file in any of two ways:

A. Upload the file directly and have a corpus automatically created for it

  1. Go to Mink and click My data
  2. Use the Add files section in the bottom of the page to upload a file from your computer
  3. A corpus is created automatically, with the uploaded file added under Source texts
  4. The corpus is given an automatically generated identifier ("mink-" followed by a random string of characters), but it is better to give it a more informative name (see the Metadata section)

B. Create a corpus, then upload a file to it

  1. Go to Mink and click My data
  2. Click New corpus
  3. Enter a name
  4. Select source format to match your text files
  5. Click Create
  6. On the next page, scroll to the bottom and use the Add files section to upload a file from your computer

Source files

Managing source files

At any time, once a corpus has been created, you can upload source files to it. Use the upload area in the bottom of the corpus overview page, or drag a file to anywhere on that page.

Source files should contain text data which will form the very content of the corpus. They need to be in the same format as the corpus is configured for, such as plain text or PDF. (If you created the corpus by uploading files, the corpus is automatically configured accordingly.)

In case you upload a file twice (or different files with the same name) the last file replaces the first.

You can delete a source file by using the trash can button in the file listing.

Source file details

The source files are listed in the bottom of the corpus overview page. Click a filename to view more details. On the details page you can view and download the original and plain-text versions of the file:

  • The original version can be viewed only if it is in plain text or XML.
  • The plain-text version is created during annotation, and can only be viewed/downloaded after annotation is done.

Size limits

There is a maximum file size for a source file, and a maximum corpus size for the source files in a corpus. There are also recommended maximum file/corpus sizes. These are not exact, but exceeding these significantly may slow the annotation job down considerably.

If you need to exceed the maximum limits for your project, please get in touch with Språkbanken Text at sb-info@spraakbanken.gu.se.

Privacy and data retention

See the Privacy and data policy


Corpus settings

In the corpus overview, you can edit the corpus settings by clicking Edit in the Settings panel. Mink uses the Sparv annotation pipeline in the background, and these settings are translated to Sparv configuration.

Source format

All source files in a corpus must be of the same format, and it needs to match the setting here. Set it according to your source files. If you have already uploaded files of another format, you will not be able to change it until you remove the files.

The setting relies on a one-to-one mapping between file formats and filename extensions. If you have different extensions, for example plain text files without any extension, you will need to rename the files accordingly. Extensions in uppercase will automatically be changed to lowercase.

Text element

If the file format is XML, you need to specify what XML element represents a text. For instance, if your files would be of the structure in the example below, you would enter article in this setting.

<article>
  <heading>Lorem ipsum dolor sit amet</heading>
  <body>Donec dapibus mauris lectus, ut finibus odio vestibulum ut.</body>
</article>

<article>
  <heading>Nunc convallis et purus in bibendum</heading>
  <body>Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia curae.</body>
</article>

It follows that all XML files in the same corpus must follow the same structure, or at least use the article element. Some more (technical) information is available in the Sparv docs, under Import Options and the item import.text_annotation.

Annotation options

Sparv offers a wide range of settings to control the annotation procedure. Until a user interface for the entire configuration structure has been put in place, we have added fields for a couple of the options.

The Existing sentence segmentation option can be used in case you have already prepared your source texts by putting each sentence on its own line. Using this option will instruct Sparv to skip the built-in sentence segmentation algorithm and use your segmentation instead.

The Time span fields can be used to coarsely timestamp your material for the Korp and Strix tools, described in the Install results into tools section. You can, for instance, create a corpus for each year in a time period you are studying, in order to get trend diagrams (see the Korp manual) or search by year. You would then set Time span: From to 2022-01-01 and Time span: To to 2022-12-31 for the corpus representing the year 2022.


Metadata

Back in the corpus overview, you can edit the corpus metadata by clicking Edit in the Metadata panel. Just like the corpus settings, these are encoded as "configuration" for Sparv, but they do not affect the annotation per se, but are only written directly to the output data.

To be able to distinguish your different corpora from each other, a Name is required in both Swedish and English. Feel free to enter the same value in both fields if the language distinction is not important for you. The same goes for Description, except that this is totally optional.


Run automatic annotation

Before the automatic annotation can be started for a corpus, it needs at least one source file and valid settings and metadata, according to the previous sections. Then, use the Run annotation button in the Status panel. This will create an annotation job and place in a queue. A few jobs can be run in parallel, and hopefully your job is started immediately.

Once the job is queued, the status panel will update continuously at short intervals to fetch and show the current job status. If you change your mind about the annotation, you can interrupt it with the Abort button.

In case of an error with the annotation, please consult the warning/error output in the Status panel. Try changing the source files or settings, and try again. If you need help, please contact Språkbanken Text at sb-info@spraakbanken.gu.se.


Download results

The product of successful annotation is a set of export files, in Sparv terminology. These contain the source text with annotations and metadata added. On their own, they are not very easy for human eyes to read, but they can be used in custom scripts or other software to help answering various kinds of questions.

You can download all export files at once using the link labeled Archive, or click the Single files link to view the list of files and download them one by one.

Export files are organized in folders by type:

  • The xml_export.pretty folder contains result data as XML. Each file typically has the text tokenized in the "text", "sentence" and "token" levels, with tag names accordingly. If the source files were in XML, however, this structure is merged with the existing structure in the source files. Annotations and metadata are encoded as XML attributes.
  • The csv_export folder contains result data in a tabular format resembling the CoNLL-U format. The CSV format is not as flexible as XML, so metadata and some annotations are added on special lines starting with the # character.
  • There is an additional CSV file whose filename starts with stats_. This contains token frequency statistics.

The CSV files can be imported to spreadsheet editors such as Excel or LibreOffice Calc. Make sure you select UTF-8 as encoding and tabs as separator.


Install results into tools

Språkbanken's tools are well known in the language research community for providing advanced search functionality to a large repository of public corpus data. In Mink, you can use some of these tools for your own data, in a private section that only you can access.

Korp

Korp (try it, read more) is Språkbanken's word research platform. It allows searches with different degrees of complexity and shows results in a keyword-in-context (KWIC) layout. You can also get token frequencies by various annotation values and more.

Strix

Strix (try it) is Språkbanken's text research platform. It offers advanced filters and a reading mode in which you can choose any meta data attribute and the corresponding words will be highlighted in the text. This may help you to see a text from a different perspective.

Installation jobs

To sync your annotated data to one of the tools, click the Install button near the tool name. This will create and queue an installation job, similar to the annotation job.

When done, you can click the View button near the tool name to open the tool in the Mink mode. In case you have more corpora installed from Mink, you can enable one or more of them in the corpus selector, and searching and statistics will include them all. See the Korp user manual for more help (but note that it is written for the public use case, without Mink in mind). A user manual for Strix is in the works.

If there is an error with the installation job, try changing source files or settings according to the warning/error output in the Status panel. You will then probably have to re-run annotation before retrying the installation.