Skip to content

Adding your own text data to Strix

Strix allows users to upload their own text data (a collection of documents) and leverage its advanced functionalities to analyze, visualize, and explore the data. This feature is particularly useful for researchers, linguists, and organizations looking to exploit their custom datasets.


Why add your own data?

By adding your own text data to Strix through Mink, you can:

  • Perform advanced searches: Use simple and document searches to explore your custom datasets.
  • Visualize metadata: Analyze metadata attributes and geo-locations using Statistics and Maps.
  • Explore semantic relationships: Use the Related documents feature to uncover connections between documents.
  • Analyze linguistic patterns: Dive into word, sentence, and text-level metadata attributes for deeper insights.

What is Mink?

Mink is Språkbanken's data platform that allows users to upload their collections and apply advanced language technology methods to their texts. The resulting annotated data can be:

  • Downloaded for offline use.
  • Integrated into research tools like Korp and Strix for further analysis.

You can read more about Mink and its documentation and tutorials at https://spraakbanken.gu.se/en/tools/mink.

All data uploaded to Mink is securely stored behind a login and is not publicly available to other users.


How to add your data to Strix

Below are the steps to upload your data in Mink, annotate it, and make it available in Strix.


1. Prepare your data

Before uploading your data, ensure it meets the following requirements:

  • File format: Supported formats include .txt, .docx, .odt, .pdf, or .xml.
  • Metadata: Include metadata for each document (e.g., title, author, year, genre) as tags/attributes if the file format is .xml.
  • Encoding: Use UTF-8 encoding to ensure compatibility.
  • File size: Ensure individual files do not exceed the maximum upload size (e.g., 10 MB per file).

2. Upload your data

  1. Log in to the Mink platform.
  2. Create a corpus name for your collection.
  3. Select your files or drag and drop them into the upload area.
  4. Edit the configuration if needed. By default, Mink creates a configuration for each corpus, which includes the following annotations added to each document using the Sparv annotation tool:
    • Part of speech tags
    • Base form (Lemma)
    • Morphosyntactic tags (MSD)
    • Dependencies
    • Sentiment labels
  5. Run the annotation process.
  6. Once the annotation is completed, the annotated data will be ready for download and available for installation in Strix and Korp.

3. Index your data and install in Strix

After annotating your data, install the corpus into Strix:

  1. Install the annotated corpus into Strix from the Mink platform.
  2. Strix will automatically index your data to make it searchable and compatible with its advanced features.
  3. Monitor the indexing progress in the Status section in Mink.
  4. Once the installation is complete, the Status section will display a "Done" message.

4. Access your data in Strix

After indexing, your data will appear under the Mink mode (personal collections) in Strix. You can either:

  1. Go to Strix and log in to view your data in Mink mode
  2. Or, follow the link from Mink to Strix by clicking on the Open button located next to the Install button.

Once, you are in Strix, you can:

  • Select your dataset to perform searches and visualizations.
  • Combine your dataset with other existing corpora for comparative analysis.

Example use case: Analyzing global warming

Imagine you are a researcher studying global warming and its representation in political speeches. You have a collection of speeches and reports that you want to analyze. Here’s how you can use Mink and Strix to explore your data:

  1. Upload your data:

    • Prepare your collection of speeches and upload them to Mink.
    • Annotate the data using Sparv to add linguistic metadata like part of speech tags and sentiment labels.
  2. Install in Strix:

    • Install the annotated corpus into Strix and index it.
  3. Perform searches:

    • Use Simple search to find occurrences of terms like global warming (global uppvärmning) or climate change (klimatförändring).
    • Use Document search to explore semantically similar documents discussing renewable energy or sustainability.
  4. Visualize metadata:

    • Use the Statistics tab to analyze the frequency of terms like "carbon emissions" or "renewable energy."
    • Use the Maps tab to visualize geo-locations mentioned in the speeches, such as references to international climate agreements.
  5. Explore related documents:

    • Use the Related documents feature to find connections between speeches from different political parties or organizations.

By following these steps, you can uncover patterns, trends, and insights into how global warming is discussed in your dataset.


Troubleshooting and support

If you encounter any issues while uploading or indexing your data:

  • Ensure that your files meet the format and size requirements.
  • Check the Status section in Mink for error messages or warnings.
  • Contact the Strix support team at sb-info@svenska.gu.se for assistance.

Start uploading your data today and unlock the full potential of Strix for your research!