The Scandinavian Project of Contrastive Corpus Studies

Next: The Canadian Hansard Corpus Up: Other projects Previous: The LINGUA project

The Scandinavian Project of Contrastive Corpus Studies

There is an ongoing Scandinavian project involving four partners in Norway, Finland, Denmark and Sweden. Swedish is represented by the Department of English at Lund University with their ``Text-based Contrastive Studies in English'' [Aijmer and Altenberg1995]. Norway is represented by those who were already involved with the English-Norwegian Parallel Corpus, ENPC and is described in JohHofl94. Finland's involvement is by way of the Finnish-English Contrastive Corpus Studies (FECCS) project at the Department of English, University of Jyväskylä, Finland. We have no information about the Danish project. All four corpora will be built up according to the same structure. Each corpus consists of two parts: one parallel corpus of original texts together with their translations, and one comparable corpus consisting of original texts in both languages. The corpora are to be used in contrastive studies between the Scandinavian languages and English.

The parallel corpus between Swedish and English is planned to consist of 1,600,000 words in different samples (each sample 10,000-15,000 words) from both directions. The corpus will become available as soon as all the copyright restrictions are resolved.

In August 1995 the size of the Norwegian parallel corpus was 1.3 million words of aligned texts. In 1993, when the Norwegian project began, considerable work had already been done on alignment and almost all methods were based on statistics. ENPC, however, wanted to focus on language specific information, so one of the Norwegian partners developed the project's alignment algorithm based on ``anchor words,'' which are words likely to be translated in predictable way and can therefore be used as anchors or points of reference to establish a correct alignment [Hofland1995a]. The alignment method has recently been refined by using techniques from the CRATER project [McEnery and Oakes1995]. This involves using cognates in order to align sentences. Cognates are identified by two means: truncating words to the first 6 characters, and Dice's similarity coefficient, which is produced by dividing the matching number of bigrams in two words by the mean of the number of bigrams for the words. It can be seen as an extension of the method of using anchor words. The method has not been very productive for novels, where cognates only account for 0.5% of the values in the matrix for English-Norwegian pairs but it is expected to more useful in non-fiction and technical texts [Hofland1995a, sec.\ 3,].

The Finnish corpus consists of approximately 2 million words in parallel texts but have not yet been aligned. For the alignment of these texts the Finnish partner will be using a locally developed program. So far there has been no work on POS-tagging.

Next: The Canadian Hansard Corpus Up: Other projects Previous: The LINGUA project

Daniel Ridings
Sun Mar 31 09:05:43 METDST 1996