The PAROLE corpora will fit the requirements for a comparable corpus as far as is realistically possible for the time being. PAROLE will produce Belgian, Catalan, Danish, Dutch, English, Finnish, French, German, Greek, Irish, Italian, Portuguese, Spanish and Swedish corpora. All of these will use common criteria for composition and classification of text types. These have been described by Ole Norling-Christensen wp411 and will be starting point for PEDANT.
The research community has not reached a consensus concerning text typology. There is a general consensus concerning the medium or mode of texts and reasonable agreement on genre. The point of controversy is ``topic.'' On the one hand there are desires to classify texts based on a closed list of keywords and on the other hand it is argued that we can never classify the texts produced in the world by a limited set of attributes. The feeling is that anything can be thought about and thereby written about. Attempts to force this reality into a limited set of categories is futile. Few would disagree with this, but old habits die hard and those working with corpora have, through the years, become accustomed to assigning topics to their texts and reply that some kind of ``broad topic'' can nevertheless be assigned. We sympathize with the position that there is a real need for objective scientific means of assigning topics, but are also aware that such criteria are simply not there yet. Therefore, the idea of a ``broad topic'' will be reflected in our classification.
We classify our texts according to medium and genre. Medium, for example, is described as book, newspaper, periodic while genre, in our case, consists of such categories as discussion, fiction, information, official and instruction. The full documentation of our text types and classification will the the subject of a future report.