Hoppa till huvudinnehåll

Mink user manual

In Mink, you can create Corpora and Lexicons, and possibly more resource types in the future. These contain very different kinds of data, and we process them differently, but the workflow looks mainly the same for both types.

This manual will first give an overview of the main workflow, and then go into the details of each resource type.

Contents

  1. Creating a resource and uploading source files
  2. Sharing
  3. Source files
  4. Resource configuration
  5. Analysing
  6. Downloading results
  7. Installing results in tools
  8. Corpus resources
  9. Lexicon resources

Creating a resource and uploading source files

Any sources you upload to Mink must belong to a resource. You can group related source files in one resource, and you can create as many resources as you want.

You can create a resource and upload your files in any of two ways:

A. Upload files directly and have a resource automatically created

  1. Go to Mink and click My data
  2. Use the Upload files area in the bottom of the New resource section to upload one or many files from your device
  3. A resource is created automatically, with the uploaded file added under Source files
  4. The resource is given an automatically generated identifier ("mink-" followed by a random string of characters), but it is better to give it a more informative name (see the Metadata tab section)

B. Create a resource, then upload a file to it

  1. Go to Mink and click My data
  2. Click one of the New... buttons in the New resource section
  3. Fill out the form fields as necessary
  4. Click Create
  5. On the next page, scroll to the bottom and use the Upload files section to upload files from your device

Sharing

If you need to collaborate with someone on your resource, click the Manage access button in the Sharing panel to view the resource in our authentication system, SB-Auth. Here you can see who has access and add new users. The access of a given user to a given resource can be one of three levels: READ, WRITE or ADMIN. Read about the meaning of these levels in Mink, back in the Sharing panel.

There is currently no built-in functionality for publishing a resource to the general public. If you want to do this, get in touch with us at sb-info@svenska.gu.se.


Source files

The source files of a resource are listed in a panel on the resource overview page.

Managing source files

At any time, once a resource has been created, you can upload source files to it. Use the upload area in the bottom of the Source files panel, or drag a file to anywhere on the resource overview page.

In case you upload a file twice (or different files with the same name) the last file replaces the first.

You can delete a source file by using the trash can button in the file listing.

Source file details

Click a filename to view more details and view or download the file itself.

Size limits

There is a maximum file size for a source file, and a maximum total size for the source files in a resource. There are also recommended maximum file/total sizes. These are not exact, but exceeding these significantly may slow the annotation job down considerably.

If you need to exceed the maximum limits for your project, please get in touch with Språkbanken Text at sb-info@svenska.gu.se.

Privacy and data retention

See the Privacy and data policy


Resource configuration

In the resource overview, the Configuration panel shows a summary of the resource configuration. The configuration informs the analysis process about your resource. Click Edit to open the configuration form. The form is split into tabs, and has buttons in the bottom to save and more. If you make changes in multiple tabs, clicking Save will apply all of them.

The Metadata tab

The resource metadata do not affect the analyses per se, but are forwarded to the output data.

When a Mink resource is created, it is assigned an identifier in the form of mink-<random string>. You can also provide Name and Description in Swedish and/or English to help distinguish one resource from another.

The Settings tab

These control the analysis process, and the fields are different for Lexicons and Corpora. See their respective sections below.

Custom config

In the bottom of the configuration page, click the button Custom config to take full control of the configuration.

Here you can read the exact resource config that is realized from your selections in the form. The config is in the YAML format. To make changes beyond the standard configuration form, you can edit the file in the browser or upload a new YAML file.

When editing, you will get hints when your changes do not conform to the YAML format and the config schema.

After modifying the YAML config, please make sure not to save from the standard configuration form again, as that will overwrite your changes.

Deleting a resource

Use the red Delete button at the bottom of the configuration page to delete your resource. It will also be removed from the explore tools, if you have installed it there (see Installing results in tools).


Analysing

Use the Analyse button in the Analyse panel to create an analysis job and place in a queue. A few jobs can be run in parallel, and hopefully your job is started immediately.

Once the job is queued, the Status panel will update continuously at short intervals to fetch and show the current job status. If you change your mind about the analysis, you can stop it with the Abort button.

In case of an error with the job, please consult the warning/error output in the Status panel. Try changing the source files or configuration, and try again. If you need help, please contact Språkbanken Text at sb-info@svenska.gu.se.

Re-analysing

If you changed source files or configuration, you need to use the Analyse again button.


Downloading results

The product of successful analysis is a set of export files. These contain source data and metadata along with any enrichment or statistics added by the analysis. Most of these are meant for machine use in custom scripts or other software.

You can download all export files at once using the link labeled Archive, or click the Single files link to view the list of files, organized in folders by type, and download them one by one.

Export files are organized in folders by type.


Installing results in tools

Språkbanken's tools are well known in the language research community for providing advanced search functionality to a large repository of public language data. In Mink, you can use some of these tools for your own data, in private sections that only you can access.

Installation jobs

To sync your analysed data to one of the tools, in the Explore panel, click the Install button near the tool name. This will create and queue an installation job, similar to the annotation job.

When done, you can click the Open button near the tool name. In case you have more resources installed from Mink, you can enable one or more of them in the corpus selector. See the corresponding user manual for more help (but note that they may be written for the public use case, without Mink in mind).

If there is an error with the installation job, try changing source files or configuration according to the warning/error output in the Status panel. You will then probably have to re-analyse before retrying the installation.

To remove resource data from an exploration tool, click the Uninstall button near the tool name. The data will also be removed if you delete the whole resource (see Deleting a resource).


Corpus resources

This section contains more details specific to resources of the Corpus type.

The analysis of a corpus is carried out by Sparv.

Creating a corpus

As an exception from the general case, if you use the New corpus button and select XML as Source format, you will be required to enter a value for the Text element field. See Text element for more information.

Corpus source files

The source files should contain text data which will form the very content of the corpus. They must all be in the same file format, and the corpus is automatically configured to match it. For audio formats, the annotation procedure will use automatic speech recognition (ASR) to produce the text.

Corpus configuration

The corpus configuration page has specific fields in the Settings tab, and additionally contains an Analyses tab. Read about them below.

Settings fields

Source format

All source files in a corpus must be of the same format, and it needs to match the setting here. Set it according to your source files. If you have already uploaded files of another format, you will not be able to change it until you remove the files.

The setting relies on a one-to-one mapping between file formats and filename extensions. If you have different extensions, for example plain text files without any extension, you will need to rename the files accordingly. Extensions in uppercase will automatically be changed to lowercase.

Text element

If the file format is XML, you need to specify what XML element represents a text. For instance, if your files would be of the structure in the example below, you would enter article in this setting.

<article>
  <heading>Lorem ipsum dolor sit amet</heading>
  <body>Donec dapibus mauris lectus, ut finibus odio vestibulum ut.</body>
</article>

<article>
  <heading>Nunc convallis et purus in bibendum</heading>
  <body>Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia curae.</body>
</article>

It follows that all XML files in the same corpus must follow the same structure, or at least use the article element. Some more (technical) information is available in the Sparv docs, under Import Options and the item import.text_annotation.

Existing sentence segmentation

This option can be used in case you have already prepared your source texts by putting each sentence on its own line. Using this option will instruct Sparv to skip the built-in sentence segmentation analysis and use your segmentation instead.

Time span

These fields can be used to coarsely timestamp your material for the Korp and Strix tools, described in the Installing results in tools section. You can, for instance, create a corpus for each year in a time period you are studying, in order to get trend diagrams (see the Korp manual) or search by year. For a corpus of the year 2022, for instance, you would set Time span: From to 2022-01-01 and Time span: To to 2022-12-31.

The Analyses tab

In this section, you can select from specific analyses. Each analysis can provide one or more annotations. For more information on a single analysis, follow the link to its page on the Språkbanken website.

Custom config

The config format is described in detail in the Sparv documentation: Corpus Configuration.

Downloading corpus results

Listed below are the export types that Sparv creates for a corpus.

  • Token frequencies in the stats_export.frequency_list folder
  • CoNLL-U in the conll_export folder. This is a standardized format that can be used in various language processing software. It can also be imported to spreadsheet editors such as Excel, Numbers or LibreOffice Calc. Make sure you select UTF-8 as encoding and tabs as separator.
  • Comma-separated values (CSV) in the csv_export folder: A tabular format resembling CoNLL-U but adapted to arbitrary annotations. Despite the name, it also uses tabs as separator.
  • Plain text: If the source file format is non-plain-text, like audio or PDF, the text_export folder has the extracted plain text of each document. This is what gets used as input to the rest of the analysis.
  • XML with one token element per line: The xml_export.pretty folder contains result data as XML. Each file typically has the text tokenized in the "text", "sentence" and "token" levels, with tag names accordingly. If the source files were in XML, however, this structure is merged with the existing structure in the source files. Annotations and metadata are encoded as XML attributes.

Corpus tools

You can install your analysed corpus in the following explore tools:

  • Korp (try it, read more) is Språkbanken's word research platform. It allows searches with different degrees of complexity and shows results in a keyword-in-context (KWIC) layout. You can also get token frequencies by various annotation values and more.
  • Strix (try it, read more) is Språkbanken's text research platform. It offers advanced filters and a reading mode in which you can choose any metadata attribute and the corresponding words will be highlighted in the text. This may help you to see a text from a different perspective.

Lexicon resources

This section contains more details specific to resources of the Lexicon type.

Lexicon source files

The sources files should be in the JSON Lines format. In case you already know JSON, this is essentially a file with exactly one JSON object per line.

Here is a simple example:

{"baseform": "abonnemang", "pos": "NOUN", "cefr": 5}
{"baseform": "abort", "pos": "NOUN", "cefr": 3}
{"baseform": "absolut", "pos": "ADJ", "cefr": 2}

This models a lexicon that uses three fields (baseform, pos and cefr) and has three entries.

Lexicon configuration

In the settings of a Lexicon, you must provide the name of a field to use as entries. It is set to baseform by default, which models the case of a typical dictionary.

Downloading lexicon results

The result files of a Lexicon analysis are needed for the next step of installing the data in Karp. We do not really expect them to be very useful for you as a user. But if they are, feel free to use them.

They include:

  • SQL files for adding or removing the content in the database of Karp
  • The source data in a normalized format

Lexicon tool: Karp

Install your analysed lexicon in Karp (try it, read more) to search multiple lexicons at the same time or examine word statistics.