next up previous
Next: Umeå Up: Other projects Previous: Linköping

Uppsala

The Language Engineering group at the Department of Linguistics in Uppsala has a corpus project focusing on several domains, such as technical manuals, immigrant newspapers and political texts. The project leader is Anna Sågvall Hein.

The Scania Corpus is a collection of truck manuals from Scania. Swedish is always the source language and has been translated into seven languages: English, French, German, Spanish, Dutch, Italian and Finnish. The Swedish component adds up to 300,000 words and is the largest part of the corpus [Sang1995]. The smallest component, Finnish, consists of approximately 200,000 words. The goal is to build a corpus of 2,000,000 words. This corpus is unlikely to ever become available, since the material is ``commercial in confidence.''

The Swedish Immigrant Newspaper Corpus (swe. Invandratidningen) is available in nine different languages: Swedish, Albanian, Arabic, English, Finnish, Persian, Polish, Serbo-Croatian and Spanish. The work on this corpus has only just begun so there is no information about number of words.

The third collection, consisting of Swedish political texts, is still in the planning stage. It will contain declarations from the Swedish government (regeringsförklaringen).

So far only the Scania Corpus is ready to be used. To date none of the texts have been aligned or annotated, though work has begun on a locally developed program for alignment. All text will be stored in SGML.



Daniel Ridings
Sun Mar 31 09:05:43 METDST 1996