• Hem
  • Diabase -- Towards a diachronic BLARK
Webbkarta

Diabase -- Towards a diachronic BLARK

The need for a basic research infrastructure for language technology is increasingly recognized by the language technology research community and research funding agencies alike. At the core of such an infrastructure we find the so-called BLARK -- Basic LAnguage Resource Kit, a set of language resources and language technology tools deemed essential both to fundamental research in language technology and to the development of useful language technology applications for a language. The BLARK, as normally presented in the literature, reflects a modern standard language variety, which is topic- and genre-neutral, thus abstracting away from all kinds of language variation. However, modern linguistics increasingly recognizes variation as a fundamental and essential characteristic of human language. We thus argue that a BLARK could fruitfully be extended along any of the three axes implicit in this characterization (the social, the topical and the temporal). In our case, it would be extended along the temporal axis, towards a diachronic BLARK for Swedish, which can be used to develop e-science tools in support of historical studies.

Few other research groups in Sweden -- or anywhere, for that matter -- seem interested in the problem of parsing historical language varieties, let alone many such varieties in parallel, and in this respect, we will hopefully be able to make a genuine contribution to the field.

Resources

We are currently extending and merging two lexical resources, SALDO and Dalin. Additionally, we have three major dictionaries of Old Swedish (1225--1526): Söderwall (23,000 entries), Söderwall supplement (21,000 entries), and Schlyter (10,000 entries). Due to overlap, the three resources together contain just under 25,000 different entries/lemmas/headwords. We have started the work on creating a morphological component for Old Swedish, covering the regular paradigms and created a smaller lexicon with a couple of thousand entries.

The natural next step after linking up SALDO and Dalin would be to add the Old Swedish lexicon to this growing diachronic Swedish lexical and morphological resource. Hopefully, we will be able to start on this work within a year or two. Including the Old Swedish lexicon in the same way as we are doing this for Dalin's dictionary will probably be more difficult, however, since the distance between Old Swedish and the other two forms of the language is fairly great, something like that between modern English and Anglo-Saxon (Old English). This certainly holds for the grammar -- morphology and syntax -- of the language, and even more so for the semantic information encoded in the SALDO lexical resource. It will be a difficult but hopefully rewarding endeavor to work with the lexical semantics of Old Swedish.

Additionally, we are working on a Swedish FrameNet, building in part on the SALDO work and in part on our long experience in corpus linguistics. In this way, we should be able to forge a bridge from the lexical databases which we have already developed, to syntactic analysis systems. The hypothesis is that substantial parts of the frame semantic specifications in the modern Swedish FrameNet will carry over to the lexical items in Dalin's dictionary, using the (semantic) links independently established between SALDO and Dalin, and possibly further to the Old Swedish lexical resources.

Availability

The lexical resources that we develop in SprÄkbanken come bundled with morphological descriptions and the machinery for performing morphological analysis on texts, as well as full-form expansion of lexical entries. In the same way that the analysis and other facilities of SALDO are now available through various web services, the whole diachronic lexical resource will be made available in this way, as each component reaches a mature enough stage.

An important aspect of the BLARK concept is that all resources and tools be interoperable, i.e., common (lossless) data exchange formats and tool APIs are necessary features of a BLARK. Preferably such formats should adhere to international standards in the field, e.g. those being prepared by ISO TC37/SC4 Language Resources Management and to best practices, such as those being formulated in the framework of the ESFRI CLARIN initiative. SprÄkbanken is active in both initiatives.

Publications

© Göteborgs universitet 2009, Box 100, 405 30 Göteborg
Tel +46 31 786 0000, Kontakt

Om webbplatsen

X
Loading