Background
Current language-centric AI takes large language models (LLMs) as its most prominent inner-wheel component. It is, however, widely acknowledged that today’s most successful and utilized LLMs – even if multilingual in their set-up – are, in essence, English-dominated, monocultural and US-centric, and that they therefore tend to reinforce and homogenize languages and cultural norms and values. Since LLMs are increasingly integrated into everyday tools – from chatbots and search engines to educational platforms and policy-making tools – a reproduction and amplification of societal and cultural bias at scale highly challenge the mission – as adopted in many governmental initiatives – of implementing AI in society in a transparent, customized, and responsible way.
Project description
project aims at optimizing current and future LLMs towards a more responsible coverage and functionality that better encompass the linguistic and cultural diversity of the Nordic and Baltic regions, and that are thereby much more inclusive in relation to the societies in which they will be used. This will be done by substantially advancing state-of-the-art methods for assessment and adjustment/alignment of LLMs We will develop a new approach for diagnosing such bias where we
conceive machine translation (MT) as a proxy to linguistic (and more indirectly, cultural) proficiency. In this process, we will take advantage of existing expertise and culture-aware benchmark datasets from our regions and efficiently create them for sister languages that are lacking these resources.
The particular languages addressed in the project are: Danish, Faroese, Latvian, Norwegian (including Bokmål and Nynorsk), and Swedish; comprising thereby both low- and medium-resourced languages of the region. We will take advantage of the fact that the languages and societies share common characteristics and have comparable data sets available that will make it possible to transfer knowledge and scale up the adaption of the models beyond prototypes.