next up previous
Next: Other projects Up: Introduction Previous: Introduction

Corpora and Multilinguality

 

As already mentioned at the beginning of this report the criteria and principles for corpus composition are presently in a state of flux. In a sense one can say that the points of diversion have to do with what computers can handle. In the sixties and seventies, when computers became a significant factor in corpus oriented investigations, it was important to choose and balance the composition of a corpus carefully. One million words was a lot back then. A consequence of this was the production of long and elaborate selection criteria, which have been described in the most pointed way as ``making a virtue out of necessity'' (Sinclair). Today three hundred million words can be handled comfortably and few would be willing to guess what will be possible in five years. The default attribute for today's corpora is ``large'' [Sinclair1994, 6,], and we are moving away from regarding corpora as static collections to regarding them as a stream of language resources passing through our systems, that is, monitor corpora [Sinclair1994, 11,].

There are, then, several factors that are taken into consideration in the design and composition of monolingual corpora. A balanced corpus should be selected and weighed so that it offers a manageable model of the linguistic material to be studied. Text typology and intuition play an important role in this process. [6-9]Atkins92 provide a sizeable list containing 29 different text attributes. Some of the important extra-linguistic factors they name are Mode, Medium, Style, Genre, Topic, Date, Language and Language-Links. As will become apparent from our presentation below, only Language, Language-Links and Genre seem to be focused on when attention is turned to multilingual corpora. There are even instances when the word corpus is applied to a collection of texts simply because each text ``is translated into one or more other languages than the original'' [Sinclair1994, 11,]. This could reflect the difficulties in acquiring relevant texts -- original and translated -- that live up to all expected criteria. Though difficult, one could desire a more distinct line between what is simply a collection of translated texts, being whatever available, and a balanced corpus. In the following sections we take no position on the definition of corpus when we apply it to the various collections. They may or may not be balanced. For the time being we prefer to refer to PEDANT as a text collection. Initially we are allowing its composition to be guided by practical considerations, such as the needs of our immediate surroundings, that is, the faculty, as long as they fit in with our ultimate goals.

Throughout this paper we will distinguish between two types of multilingual corpora: parallel corpora and comparable corpora. The parallel corpora are made up of original texts and their translations. This allows the texts to be aligned and used in applications such as computer-aided translator training and machine translation systems. The alignment can be done on paragraph, sentence or phrase basis, but sentence based methods, which we describe more closely in section 3.5, are the most common.

A comparable corpus includes texts in two or more languages that are not translations of each other but comparable with respect to selection criteria such as size, domains, genre and where possible, topics.

The corpora described below display various aspects of such multilingual corpora.


next up previous
Next: Other projects Up: Introduction Previous: Introduction

Daniel Ridings
Sun Mar 31 09:05:43 METDST 1996