You have created a language resource. Now, how do you help people find it?
Resource metadata workflow
The main source of metadata for the resource is a YAML file which is parsed and included on the website. It is automatically converted into a META-SHARE XML file which can be imported to the repository where it can be assigned a Handle or DOI identifier (PID).
Who is responsible?
|Metadata YAML file||Resource owner (you)|
|Metadata XML file||Created automatically|
|Import to repository||Sysadmin|
We are planning on creating a web interface which helps with the creation of these files, but for now please follow the following steps:
- Create a YAML file, formally describing your language resource
- Start by copying the template for your resource type (corpus, lexicon or model). The templates can be downloaded here.
- If you are not sure how to fill in some of the fields, you could check other YAML resource files for comparison, keep the default values, or ask for help (e.g. in the #metadata channel on Slack).
- If the resource is public, don't forget to specify a URL in the download section and/or a link (e.g. to Korp or Karp) in the access attribute in the interface section.
- Please use some tool to perform a syntax check on your YAML file (e.g. this one) to avoid general errors in the markup.
- Save the file as
shortnameis a name made up of lowercase letters, numbers and dashes (e.g.
- This name is used as the resource ID and needs to be unique within Språkbanken.
- Upload the file on Github.
- Add the file to the corpus, lexicon or model folder in this GitHub repository. This can be done via GitHubs web interface or via a terminal.
- The following day, your resource will be listed under Data on the SBX website. If it doesn't, please contact web admin at firstname.lastname@example.org.
There are two description fields in the YAML files. The short_description describing the resource in a few words or a sentence should always be filled in.
Besides this short description, a user considering your resource is likely to need a few sentences further describing the nature of the resource, in order to consider its relevance. Data sources, temporal extension, means of text extraction, internal taxonomy, diagrams and links to blog posts are some examples of content that can improve the accessibility of your resource. This information can be provided in the description field.
This is optional, but recommended. The description field may contain plain text or html.
Text fields can always contain plain text, ie text that is not formatted.
The following fields can handle HTML-formatted text:
- annotation (swe, eng)
- description (swe, eng)
- intended_uses (swe, eng)
- caveats (swe, eng)
Import to repository
The metadata repository is Språkbanken's node in the CLARIN network of language resources. It is meant to include all our resources, but work remains on building this part of the workflow. For now, importing the metadata of a resource into the repository is manual work done by sysadmin, and only a subset of resources have been handled so far.
To request importation of a resource, please contact sysadmin at email@example.com.
Modifying existing resources
If you want to modify metadata of an existing resource, start by finding it in this GitHub repository and update it (using the web interface or a terminal).
Uploading resource files
Resource data files (containing corpora, lexicons, models etc.) should not be uploaded to the metadata GitHub repository. Instead you can store these in SVN (e.g. lexicons: https://svn.spraakdata.gu.se/sb-arkiv/pub/) or some other safe place where we, as an organisation, can guarantee that the data never gets lost. If you are unsure where to put your data, you can always ask on Slack (#metadata) or contact the directors.
In case of confusion, welcome to #metadata on Slack!