You have created a language resource. Now, how do you help people find it?
Resource metadata workflow
The main source of metadata for the resource is a YAML file which is parsed and included on the website. It is automatically converted into a META-SHARE XML file which can be imported to the SweClarin repository.
Who is responsible?
Metadata YAML file | Resource owner (you) |
META-SHARE XML file | Created automatically |
Data | See "Uploading data files" (below). |
Import to SweClarin repository | Sysadmin |
YAML file
Please follow these steps:
- Create a YAML file, formally describing your language resource
- Start by copying the template for your resource type (corpus, lexicon or model). The templates can be downloaded here.
- If you are not sure how to fill in some of the fields, you could check other YAML resource files for comparison, keep the default values, or ask for help (e.g. in the #metadata channel on Slack).
- If the resource is public, don't forget to specify a URL in the download section and/or a link (e.g. to Korp or Karp) in the access attribute in the interface section.
- Please use some tool to perform a syntax check on your YAML file (e.g. this one) to avoid general errors in the markup.
- Save the file as
{shortname}.yaml
, whereshortname
is a name made up of lowercase letters, numbers and dashes (e.g.suc3
, orsvensk-fraktur-1626-1816
)- This name is used as the resource ID and needs to be unique within Språkbanken.
- Upload the file on Github.
- Add the file to the corpus, lexicon or model folder in this GitHub repository. This can be done via GitHubs web interface or via a terminal.
- The following day, your resource will be listed under Data on the SBX website. If it doesn't, please contact web admin at sb-webb@svenska.gu.se.
Description fields
There are two description fields in the YAML files. The short_description describing the resource in a few words or a sentence should always be filled in.
Besides this short description, a user considering your resource is likely to need a few sentences further describing the nature of the resource, in order to consider its relevance. Data sources, temporal extension, means of text extraction, internal taxonomy, diagrams and links to blog posts are some examples of content that can improve the accessibility of your resource. This information can be provided in the description field.
This is optional, but recommended. The description field may contain plain text or html.
Text format
Text fields can always contain plain text, ie text that is not formatted.
The following fields can handle HTML-formatted text:
- annotation (swe, eng)
- description (swe, eng)
- intended_uses (swe, eng)
- references
- caveats (swe, eng)
Modifying existing resources
If you want to modify metadata of an existing resource, start by finding it in this GitHub repository and update it (using the web interface or a terminal).
Uploading data files (resources)
Resource data files (containing corpora, lexicons, models etc.) should not be uploaded to the metadata GitHub repository. Instead store the data on:
- server: k2.spraakdata.gu.se
- path on k2: /home/ftp/sb-resurser/data
- access path in metadata YAML file: https://spraakbanken.gu.se/lb/resurser/data/
This directory is safe place where we, as an organisation, can guarantee that the data never gets lost.
We usually compress data files using BZIP2 so it ends with the suffix .bz2. For creating archives (ie combining several files as one), we often use TAR. A free and open source program that can handle both TAR and BZIP2 is 7-ZIP. It is available for a variety of platforms.
If you are unsure where to put your data, and how, you can always ask on Språkbanken's Slack (#metadata) or contact Språkbanken Text (sb-info@svenska.gu.se).
Archive
For data that is no longer in use but still might have some value, remove the YAML-file (metadata) from the GitHub repository, then create a .tar.bz2 archive of the YAML-file and the data, and place it on the k2.spraakdata.gu.se server in the directory /home/ftp/sb-resurser/data/archive.
Import to SWE-Clarin repository
The metadata repository is Språkbanken's node in the CLARIN network of language resources. It is meant to include all our resources, but work remains on building this part of the workflow. For now, importing the metadata of a resource into the repository is manual work done by sysadmin, and only a subset of resources have been handled so far. To request importation of a resource, please contact sysadmin at sb-info@svenska.gu.se.
Support
In case of confusion, welcome to #metadata on Slack!