Skip to content

Data selection

Strix contains a diverse collection of corpora (datasets or documents) ranging from historical to modern data. Some datasets in Strix are open access and can be viewed without restrictions, while others require login access. Each corpus provides a unique perspective, allowing users to explore and analyze textual data in detail.

Each corpus in Strix belongs to one or more modes. Modes are created based on the type of collection. For example:

  • Data from 1900 to the present is categorized under the Modern mode.

  • The Mink mode is available for users who are logged in and have personal collections in Strix. This mode allows users to access and analyze their private datasets securely.

  • The Parallel mode is designed for datasets where each document has a corresponding reference document. This mode is useful for tasks like translation alignment, OCR correction, or comparing student essays with teacher corrections.

  • Other modes are created based on specific collections.

More details about modes can be found in the Modes section.

Each mode contains a list of corpora that can be selected or deselected, as explained in the Corpora section. Additionally, the Corpus details section provides a brief description of each corpus.