A brief history | Språkbanken Text

A proper account of the history of Språkbanken Text must start with the groundbreaking work of Sture Allén (1928–2022). He was a pioneer in introducing corpus linguistics in Sweden for Swedish. His 1965 PhD thesis appeared in two parts, one where he described the computer-supported method that he had used – after first learning to program it himself in machine code – in order to investigate a text corpus of 17th-century letters, and the other a scientific edition of these letters.

After defending his thesis, Allén initiated a project aiming to prepare the way for corpus-based lexicography for Swedish. The most immediate result of this project was the one-million word corpus of Swedish newstext, which provided the raw stuff for a series of Swedish dictionaries.

As professor and scientific leader of the Computational Linguistics Unit established in 1972 at the University of Gothenburg, Allén took the initiative for an undergraduate program in computational linguistics, which started at the university in 1984. However, his own main focus remained the development of corpora and corpus tools in support of Swedish lexicography, and he initiated a systematic effort to build a computational research infrastructure which could further this aim.

Språkbanken was envisioned in an op-ed piece written by Allén for the Swedish daily Dagens Nyheter in September 1970. In 1973, the Computational Linguistics Unit submitted a formal proposal to the Ministry of Education, requesting earmarked funding for what was to become Språkbanken. Two years later, this research infrastructure became a reality, when the Logotheque (as it was called initially) was established with national funding in 1975.

The focus of Språkbanken shifted noticeably around the turn of the millennium, when for various reasons the lexicographical and the language technology activities parted ways organizationally, the former being pursued in the Center for Lexicography and Lexicology established in that connection, while Språkbanken widened its language technological sphere way beyond lexicographical considerations.

Since then, Språkbanken Text has grown into a nationally and internationally recognized R&D unit for Swedish language technology and language resources. It coordinated the Swedish activities in the European CLARIN ERIC research infrastructure 2014–2024, and is the coordinating node of the national research infrastructure Språkbanken, making up one of its four nationally distributed divisions, the other three being: the speech technology division (Språkbanken Tal) at the Royal Institute of Technology (KTH) in Stockholm, the cultural heritage and language policy division (Språkbanken Sam) at the Institute of Language and Folklore in Uppsala, Stockholm, and Gothenburg, and the division coordinating the Swedish CLARIN activities (Språkbanken CLARIN) at Uppsala University.

As a research infrastructure, Språkbanken is quite unique in the sense that many of the research results coming out of the research it supports will to a considerable extent contribute to the further development of the infrastructure itself. Språkbanken supports research in language technology (text, speech, and sign) with an infrastructure which is itself built on language technology (text, speech, and sign), much like the mythological Ouroboros snake of antiquity (ouroboros).

Ouroboros, a snake that devours its own tail.

Lars Borin