Using Språkbanken corpora in NLTK

Inlagt av Peter Ljunglöf 2019-10-02

At Språkbanken we collect resources, mainly lexica and corpora, most of them in Swedish. So far we have collected Swedish corpora totalling 13 billions of words, in all kinds of genres and from all time periods.

Most of our corpora are not manually annotated, and the ones that are annotated usually have only one kind of annotation (e.g., part of speech, lemmas, dependency structures, constituent structure, etc). To be able to use the same tools to analyse any corpus, we have devised an automatic annotation tool chain, that adds annotations to all our corpora. This tool chain is called Sparv, and there is an online version where anyone can get their texts annotated. The Sparv tool chain relies heavily on NLTK, the Natural Language Toolkit, and we have been happy users of NLTK for several years.

One strange thing that struck me the other day was that we have been parasitising on NLTK without giving anything back. (Of course we don't have to "give anything back", because NLTK is a free, open-source project; and to be strict we are not parasites, because NLTK hasn't suffered from us using it; but I wanted to use the word "parasitising" so I did).

Anyway, digressions aside. What I'm trying to say is that it's very difficult for NLTK users to experiment with our corpora. Or rather, was very difficult. So I implemented a tiny little wrapper Python module to make it much easier to play with Språkbanken's corpora in NLTK.

Happy experimenting!

Usage

The wrapper is in the Github repository sb-nltk-tools, and I hope it's fairly simple to use. Just put the main file (sb_corpus_reader.py) in your working directory, and import it. Currently the module only exports one class, the SBCorpusReader.

You also need to download a corpus from Språkbanken (a suggestion is to start with Talbanken, which is not too large, and also manually part-of-speech annotated). And you have to decompress it, into an .xml file.

Now use it thusly:

>>> import nltk
>>> from sb_corpus_reader import SBCorpusReader
>>> talbanken = SBCorpusReader('talbanken.xml')
>>> talbanken.sents()
[['Individuell', 'beskattning', 'av', 'arbetsinkomster'], ['Genom', 'skattereformen', 'införs', 'individuell', 'beskattning', '(', 'särbeskattning', ')', 'av', 'arbetsinkomster', '.'], ...]
>>> talbanken.tagged_words()
[('Individuell', 'JJ'), ('beskattning', 'NN'), …]
>>> text = nltk.Text(talbanken.words())
>>> text.concordance("universitetet")
Displaying 3 of 3 matches:
  enligt doc. Göran Bergman , vid universitetet i Helsingfors , saknar hunden fö
 g bl.a. . Prof. Stig Larsson vid universitetet i Odense påpekar att man ofta ge
 itetsstaden ; i några fall utgör universitetet och studenterna fortfarande stad

I.e., in the same way as the NLTK book does with the Brown corpus and other English corpora, e.g., in chapters two, five or six.

There is a test script (sb_postagger_test.py) which you can run from the command line – it will train and evaluate several part of speech taggers on Talbanken.

Notes, disclaimers

Python version

The module is designed for Python version ≥3.4, and NLTK version ≥3. I have only tested it in Python 3.7.4, and NLTK 3.4.5.

Språkbanken corpus format

The SBCorpusReader is very tailor-made for Språkbanken's export format, which is an xml format but where two xml tags are never on the same line. There might be some corpora that are exported in another format, and therefore cannot be read by SBCorpusReader.

Large corpora

Språkbanken has lots of large corpora, and this wrapper should work regardless of corpus size... if your computer can handle it. NLTK's corpus reader is lazy and only reads parts of the corpus into memory at the time, which makes it possible to work with really large texts. But things can take very long time... to build a nltk.FreqDist() over the words in the Åtta sidor corpus (2.8 M words) took 1.5 minutes on my computer.

Future work

So far, the SBCorpusReader has only implemented the following methods of the nltk.CorpusReader interface:

.words() and .tagged_words()
.sents() and .tagged_sents()

More methods will come in the future, such as .tagged_texts() or .parsed_sents()