Tokenizes text, custom-made for Swedish
The tokenizer is initially based on NLTK's PunktWordTokenizer (which is no longer exposed by NLTK). Sparv's version is custom-made for Swedish using a wordlist and a configuration file with regular expressions, a list of common abbreviations, a list of words containing special characters etc. It is, however, possible to configure the tokenizer for other languages.