This post is based on joint work with Gerlof Bouma. Illustrations by Jan and Julija.
Here’s a sad story (it’s fictional, but sad nonetheless).
Matthias, Pernilla and Ingvar were working as computational linguists, and within a certain project painstakingly created a ingenious dataset. The community, however, did not show much interest in the dataset and it was largely forgotten. Years went. Matthias died. Pernilla invented a clever algorithm and became a multi-billionaire. Ingvar moved to USA, happened to see a crime and had to enter a witness protection program.
Years went on, and their dataset was rediscovered. It turned out the trio was ahead of their time: at the time of creation, the dataset had been a solution waiting for a question, but later, when new methods have been developed, the question was also in place, and dataset became invaluable. Euphoric language technologists rushed to it, but their high hopes were soon crushed. In order to use the dataset, they had to figure out:
- what the numbers in column 3 mean;
- why some values are missing from the ”Total” column;
- what ”Don’t use: GRPPFBH!” in the ”Comments” section means;
- where the images in collection 2 come from;
- why are texts in collection 45 different from the original;
- and approximately 647 similar questions.
Unfortunately, there was no documentation, everything was in the heads of the dataset creators. The U.S. refused to provide Ingvar’s contact details. Pernilla’s secretary’s secretary’s secretary said he would pass on the request, but apparently never did. Matthias was also difficult to get in touch with. In short, the dataset turned out to be unusable.
The story is fictional, but many of us may have experienced a similar one (perhaps on a smaller scale): some excellent data cannot be used because they were not properly documented in time. Moreover, many of us are actually guilty of not documenting their data properly (and yes, we at Språkbanken Text, too). And, obviously enough, somebody does not have to die for that to become a problem: even if Matthias were alive and active, his knowledge would not be directly available to other people. Actually, some important background knowledge about the dataset may not even catch his own attention before he represents it in some explicit form.
Nowadays, many actors are investing their effort into documenting their own datasets and into establishing best documentation practices. In the same spirit, we created a documentation template for the SuperLim collection that we are currently working on. The template will be tested on all 13 SuperLim resources (already in place for some of them), and, if it works well, will probably also be used (mutatis mutandis) for all Språkbanken Text resources.
The template was largely inspired by Gebru et al’s Datasheets for datasets concept (and its streamlined version, data cards). The key problem when designing it was to balance the following desiderata:
- it must be detailed enough to provide all the basic information about the dataset, as well as links to more detailed information;
- it must be concise enough to be easily read;
- it must not be too difficult enough to fill in (otherwise people would prefer witness protection programs to documenting datasets);
- it must be rather universal, so that it can be applied to different types of resources;
- it must not annoy those who fill it in by asking too many irrelevant questions.
In other words, it has to be small and large, rigid and flexible at the same time.
Our solution consists of six blocks of questions: I. IDENTIFYING INFORMATION (basic information about the dataset and its creators), II. USAGE (why it was created and how it can be used), III. DATA (the description of the contents), IV. ETHICS AND CAVEATS (ethical considerations, potential pitfalls, discouraged usage), V. ABOUT DOCUMENTATION (metametadata: versions, dates and so on) and VI. OTHER.
You are also welcome to use the template!