The need for a basic research infrastructure for language technology is increasingly recognized by the language technology research community and research funding agencies alike. At the core of such an infrastructure we find the so-called BLARK -- Basic LAnguage Resource Kit, a set of language resources and language technology tools deemed essential both to fundamental research in language technology and to the development of useful language technology applications for a language. The BLARK, as normally presented in the literature, reflects a modern standard language variety, which is topic- and genre-neutral, thus abstracting away from all kinds of language variation. However, modern linguistics increasingly recognizes variation as a fundamental and essential characteristic of human language. We thus argue that a BLARK could fruitfully be extended along any of the three axes implicit in this characterization (the social, the topical and the temporal). In our case, it would be extended along the temporal axis, towards a diachronic BLARK for Swedish, which can be used to develop e-science tools in support of historical studies.
Few other research groups in Sweden -- or anywhere, for that matter -- seem interested in the problem of parsing historical Swedish language varieties, let alone many such varieties in parallel, and in this respect, we will hopefully be able to make a genuine contribution to the field.
A related project is MAÞiR, in which tools for automatic linguistic analysis of Old Swedish are created.
We are currently extending and merging two lexical resources, SALDO and Dalin. Additionally, we have three major dictionaries of Old Swedish (1225--1526): Söderwall (23,000 entries), Söderwall supplement (21,000 entries), and Schlyter (10,000 entries). Due to overlap, the three resources together contain just under 25,000 different entries/lemmas/headwords. We have started the work on creating a morphological component for Old Swedish, covering the regular paradigms and created a smaller lexicon with a couple of thousand entries.
The natural next step after linking up SALDO and Dalin would be to add the Old Swedish lexicon to this growing diachronic Swedish lexical and morphological resource. Hopefully, we will be able to start on this work within a year or two. Including the Old Swedish lexicon in the same way as we are doing this for Dalin's dictionary will probably be more difficult, however, since the distance between Old Swedish and the other two forms of the language is fairly great, something like that between modern English and Anglo-Saxon (Old English). This certainly holds for the grammar -- morphology and syntax -- of the language, and even more so for the semantic information encoded in the SALDO lexical resource. It will be a difficult but hopefully rewarding endeavor to work with the lexical semantics of Old Swedish.
Additionally, we are working on a Swedish FrameNet, building in part on the SALDO work and in part on our long experience in corpus linguistics. In this way, we should be able to forge a bridge from the lexical databases which we have already developed, to syntactic analysis systems. The hypothesis is that substantial parts of the frame semantic specifications in the modern Swedish FrameNet will carry over to the lexical items in Dalin's dictionary, using the (semantic) links independently established between SALDO and Dalin, and possibly further to the Old Swedish lexical resources.
The lexical resources that we develop in Språkbanken come bundled with morphological descriptions and the machinery for performing morphological analysis on texts, as well as full-form expansion of lexical entries. In the same way that the analysis and other facilities of SALDO are now available through various web services, the whole diachronic lexical resource will be made available in this way, as each component reaches a mature enough stage.
An important aspect of the BLARK concept is that all resources and tools be interoperable, i.e., common (lossless) data exchange formats and tool APIs are necessary features of a BLARK. Preferably such formats should adhere to international standards in the field, e.g. those being prepared by ISO TC37/SC4 Language Resources Management and to best practices, such as those being formulated in the framework of the ESFRI CLARIN initiative. Språkbanken is active in both initiatives.
- Yvonne Adesam, Malin Ahlberg, Peter Andersson, Gerlof Bouma, Markus Forsberg, Mans Hulden 2014. Computer-aided Morphology Expansion for Old Swedish
- Gerlof Bouma, Yvonne Adesam 2013. Experiments on sentence segmentation in Old Swedish editions
- Yvonne Adesam, Malin Ahlberg, Gerlof Bouma 2012. Processing spelling variation in historical text
- Lars Borin, Markus Forsberg, Christer Ahlberger. 2011. Semantic Search in Literature as an e-Humanities Research Tool: CONPLISIT – Consumption Patterns and Life-Style in 19th Century Swedish Literature. NEALT Proceedings Series (NODALIDA 2011 Conference Proceedings). p 58-65. Riga: NEALT.
- Lars Borin, Dana Dannélls, Markus Forsberg, Maria Toporowska Gronostaj, Dimitrios Kokkinakis. 2010. The Past Meets the Present in the Swedish FrameNet++. 14th EURALEX International Congress.
- Lars Borin, Markus Forsberg, Dimitrios Kokkinakis. 2010. Diabase: Towards a diachronic BLARK in support of historical studies. Proceedings of LREC 2010. p 35-42. Valletta: ELRA.
- Lars Borin, Dimitrios Kokkinakis. 2010. Literary onomastics and language technology. Literary education and digital learning. Methods and technologies for humanities studies, ed. by Willie van Peer, Sonia Zyngier and Vander Viana. Hershey, New York. Information Science Reference. p 53-78.
- Lars Borin, Markus Forsberg. 2008. Something old, something new: A computational morphological description of Old Swedish. LREC 2008 Workshop on Language Technology for Cultural Heritage Data (LaTeCH 2008). p 9-16. Marrakech: ELRA.
- Lars Borin, Dimitrios Kokkinakis, Leif-Jöran Olsson. 2007. Naming the past: Named entity and animacy recognition in 19th century Swedish literature. ACL 2007 Workshop on Language Technology for Cultural Heritage Data (LaTeCH 2007). p 1-8. Prague: ACL.