Skip to main content
Språkbanken Text is a part of Språkbanken.

sbx2hf: Towards syncing SBX resources with the Hugging Face Hub

Submitted by Felix Morger on 2025-02-26

The Hugging Face Hub is a platform for uploading models, datasets and demo apps, mainly for machine learning and deep learning. It has become one of the largest platforms of its kind and plays an integral role in deveoping AI today. Språkbanken has a similar goal to make language data and language resources available to the NLP and broader research community. We publish these primarily on our resource pages, but we also plan to share these on Hugging Face in order make it easier for machine learning practitioners to use them directly.

In order to automate this work for our vast collection of resources, we have developed a new tool called sbx2hf. This tool takes either a URL to an existing Språkbanken resource page or a list of Sparv .xml files and converts these to a valid Hugging Face Git repository, which can be shared on the Hugging Face Hub. Below is an example on how this can be done:


sbx2hf https://spraakbanken.gu.se/resurser/forarbeten1734 --push-to-hub

The --push-to-hub flag tells sbx2hf to upload the resulting repository to the Hugging Face Hub. The resulting repository does not only contain the data, but also creates a README.md and fills in metadata information on Hugging Face based on the SBX resource's available metadata. The SBX resource and the corresponding Hugging Face repository link is the same.

could not load image

There is now ongoing work to upload these resources to the Hugging Face Hub. Some (like text corpora) can be uploaded immediately while others, such as models, are not supported yet by sbx2hf. If a resource is available on the Hugging Face Hub, it is indicated on the resource page.

could not load image

Find out more about how to use the tool on GitHub.