Alumni | Språkbanken Text

Here is a list of our past PhD students, with abstracts and links to their theses.

Morger, Felix. In the minds of stochastic parrots: Benchmarking, evaluating and interpreting large language models. 2024.

Abstract
The arrival of large language models (LLMs) in recent years has changed the landscape of natural language processing (NLP). Their impressive performance on popular benchmarks, ability to solve a range of different tasks and their human-like linguistic interactional abilities, have prompted a debate into whether these are just "stochastic parrots" who are cleverly repeating what humans say without understanding its meaning or whether they are acquiring essential language capabilities, which would be an important stepping stone towards artificial general intelligence. To tackle this question, developing analysis methods to measure and understand the language capabilities of LLMs has become a defining challenge. These include developing benchmarks to reliably measure their performance as well and interpretability methods to gauge their inner-workings. This is especially relevant at a time when these models already are having a considerable impact on our society. An increasing amount users are affected by the technology and calls are made for transparent, regulated and thorough evaluation of AI. In these efforts, it is important to estimate the possibilities and limitations of these analysis methods since they will play an important role in holding technologies in AI accountable. In this compilation thesis, I expound on the components and processes involved in analyzing LLMs. The articles included in this compilation thesis use different approaches for analyzing LLMs, from introducing a multi-task benchmark Superlim for Swedish NLU to investigating LLMs' ability to predict language variation. To this effort I explore what the possibilities and limitations are of popular analysis methods and what implications these have for developing LLMs. I argue that integrating explanatory approaches from empirical linguistic research is important to understand the role of both the data and the linguistic features used when analyzing LLMs. Doing so does not only help guide the development of LLMs, but also bring insights into linguistics.

Alfter, David. Exploring natural language processing for single-word and multi-word lexical complexity from a second language learner perspective. 2021.

Abstract
In this thesis, we investigate how natural language processing (NLP) tools and techniques can be applied to vocabulary aimed at second language learners of Swedish in order to classify vocabulary items into different proficiency levels suitable for learners of different levels. In the first part, we use feature-engineering to represent words as vectors and feed these vectors into machine learning algorithms in order to (1) learn CEFR labels from the input data and (2) predict the CEFR level of unseen words. Our experiments corroborate the finding that feature-based classification models using 'traditional' machine learning still outperform deep learning architectures in the task of deciding how complex a word is. In the second part, we use crowdsourcing as a technique to generate ranked lists of multi-word expressions using both experts and non-experts (i.e. language learners). Our experiment shows that non-expert and expert rankings are highly correlated, suggesting that non-expert intuition can be seen as on-par with expert knowledge, at least in the chosen experimental configuration. The main practical output of this research comes in two forms: prototypes and resources. We have implemented various prototype applications for (1) the automatic prediction of words based on the feature-engineering machine learning method, (2) language learning applications using graded word lists, and (3) an annotation tool for the manual annotation of expressions across a variety of linguistic factors.

Nieto Piña, Luis. Splitting rocks: Learning word sense representations from corpora and lexica. 2019.

Abstract
The representation of written language semantics is a central problem of language technology and a crucial component of many natural language processing applications, from part-of-speech tagging to text summarization. These representations of linguistic units, such as words or sentences, allow computer applications that work with language to process and manipulate the meaning of text. In particular, a family of models has been successfully developed based on automatically learning semantics from large collections of text and embedding them into a vector space, where semantic or lexical similarity is a function of geometric distance. Co-occurrence information of words in context is the main source of data used to learn these representations. Such models have typically been applied to learning representations for word forms, which have been widely applied, and proven to be highly successful, as characterizations of semantics at the word level. However, a word-level approach to meaning representation implies that the different meanings, or senses, of any polysemic word share one single representation. This might be problematic when individual word senses are of interest and explicit access to their specific representations is required. For instance, in cases such as an application that needs to deal with word senses rather than word forms, or when a digital lexicon's sense inventory has to be mapped to a set of learned semantic representations. In this thesis, we present a number of models that try to tackle this problem by automatically learning representations for word senses instead of for words. In particular, we try to achieve this by using two separate sources of information: corpora and lexica for the Swedish language. Throughout the five publications compiled in this thesis, we demonstrate that it is possible to generate word sense representations from these sources of data individually and in conjunction, and we observe that combining them yields superior results in terms of accuracy and sense inventory coverage. Furthermore, in our evaluation of the different representational models proposed here, we showcase the applicability of word sense representations both to downstream natural language processing applications and to the development of existing linguistic resources.

Pilán, Ildikó. Automatic proficiency level prediction for Intelligent Computer-Assisted Language Learning. 2018.

Abstract
With the ever-growing presence of electronic devices in our everyday lives, it is compelling to investigate how technology can contribute to make our language learning process more efficient and enjoyable. A fundamental piece in this puzzle is the ability to measure the complexity of the language that learners are able to deal with and produce at different stages of their progress. In this thesis work, we explore automatic approaches for modeling linguistic complexity at different levels of learning Swedish as a second and foreign language (L2). For these purposes, we employ natural language processing techniques to extract linguistic features and combine them with machine learning methods. We study linguistic complexity in two types of L2 texts: those written by experts for learners and those produced by learners themselves. Moreover, we investigate this type of data-driven analysis for the smaller unit of sentences. Automatic proficiency level prediction has a number of application potentials for the field of Intelligent Computer-Assisted Language Learning, out of which we investigate two directions. Firstly, this can facilitate locating learning materials suitable for L2 learners from corpora, which are valuable and easily accessible examples of authentic language use. We propose a framework for selecting sentences suitable as exercise items which, besides linguistic complexity, encompasses a number of additional criteria such as well-formedness and independence from a larger textual context. An empirical evaluation of the system implemented using these criteria indicated its usefulness in an L2 instructional setting. Secondly, linguistic complexity analysis enables the automatic evaluation of L2 texts which, besides being helpful for preparing learning materials, can also be employed for assessing learners' writing. We show that models trained partly or entirely on reading texts can effectively predict the proficiency level of learner essays, especially if some learner errors are automatically corrected in a pre-processing step. Both the sentence selection and the L2 text evaluation systems have been made freely available on an online learning platform.

Ribeck, Judy. Steg för steg. Naturvetenskapligt ämnesspråk som räknas. 2015.

Abstract
In this work, I present a linguistic investigation of the language of Swedish textbooks in the natural sciences, i.e., biology, physics and chemistry. The textbooks, which are used in secondary and upper secondary school, are examined with respect to traditional readability measures, e.g., LIX, OVIX and nominal ratio. I also extract typical linguistic features of the texts, typicality being determined using a proposed quantitative method, labelled the index principle. This empirical, corpus-based method relies on automatic linguistic annotations produced by language technology tools to calculate what I call index lists, rank-ordered lists of characteristic linguistic features of specific text corpora as compared to reference texts. I produce index lists for typical vocabulary, noun phrase structures and syntactic structures, extracted from a 5.2 million word textbook corpus, compiled as a part of the work presented. As well as being frequent and well dispersed, the linguistic variables selected for the index lists are also characteristic of the text type in question, as is evident when they are compared to a reference corpus, comprising textbooks in the social sciences and mathematics, as well as narrative and academic (university-level) texts. The results show that textbooks in natural science contain a lot of content-specific, technical vocabulary. This characteristic not only distinguishes natural scientific language from everyday language, but also from social scientific language, which on the lexical level has more in common with narrative texts. On the other hand, the textbook language as a whole is structurally distinguishable from narrative texts, as clearly seen, e.g., in its noun phrase complexity. In the transition between secondary and upper secondary school, the scores of almost every readability measure go up, indicating an increase in linguistic demands on the readers. In the upper secondary textbooks the words are longer, the vocabulary more varied, the noun phrases longer and more elaborate, and the most typical syntactic structures more complex. Notably, the linguistic development between the form levels is more marked in the natural-science textbooks, compared to social sciences and mathematics. Nevertheless, the textbook language overall shows a relatively low complexity in comparison to academic language.

Rama, Taraka. Vocabulary lists in computational historical linguistics. 2015.

Abstract
Computational analysis of historical and typological data has made great progress in the last fifteen years. In this thesis, I work with vocabulary lists for addressing some classical problems in historical linguistics such as cognate identification, discriminating related languages from unrelated languages, assigning possible dates to splits in a language family, and providing an internal structure to a language family. I compare the internal structure inferred from vocabulary lists with the family trees given in Ethnologue. I explore the ranking of lexical items in the widely used Swadesh word list and compare my ranking to another quantitative reranking method and short word lists composed for discovering long-distance genetic relationships. I show that the choice of string similarity measures is important for internal classification and for discriminating related from unrelated languages. The dating system presented in this thesis can be used for assigning age estimates to any new language group and overcomes the assumption of a constant rate of lexical replacement assumed by glottochronology. I train and test a linear classifier based on gap-weighted subsequence features for the purpose of cognate identification. An important conclusion from these results is that n-gram approaches can be used for different historical linguistic purposes.

Eklund, Ann-Marie. The Game of Health Search. 2014.

Abstract
Almost two of three Swedes use internet to search for health related information on diseases, treatments and care givers. Mobile devices such as smartphones and tablets are increasingly used to carry out these activities, and it raises the question on how a health information portal should behave to support the needs of today’s and tomorrow’s information seekers. In this thesis we present an analysis of the use of the official health information portals 1177.se and vardguiden.se with a focus on describing the relations between seekers and portals, as expressed by the language of queries and answers. Of special interest is the role of the language as a means to establish and maintain the seekers’ trust in a portal as a complement to doctor’s visits and calls. We present a number of principles of behaviour to which we believe a portal should adhere to be trustworthy in the eyes of the seekers. We also introduce a conceptual framework with a basis in game-theoretic models of rational behaviour, and the use of lingustic error analysis and stylistics, to provide a setting for analysis of information search.

Heimann Mühlenbock, Katarina. I see what you mean. 2013.

Abstract
This thesis aims to identify linguistic factors that affect readability and text comprehension, viewed as a function of text complexity. Features at various linguistic levels suggested in existing literature are evaluated, including the Swedish readability formula LIX. Natural language processing methods and resources are employed to investigate characteristics that go beyond traditional superficial measures. A comparable corpus of eay-to-read and ordinary texts from three genres is investigated, and it is shown how features present at various levels of representation differ quantitatively across text types and genres. The findings are confirmed in significance tests as well as principal component analysis. Three machine learning algorithms are employed and evaluated in order to build a statistical model for text classification. The results demonstrate that a proposed language model for Swedish (SVIT), utilizing a combination of linguistic features, actually predicts text complexity and genre with a higher accuracy than LIX. It is suggested that the SVIT language model should be adopted to assess surface language properties, vocabulary load, sentence structure, idea density levels as well as the personal interests of different texts. Specific target groups of readers may then be provided with materials tailored to their level of proficiency.

Dannélls, Dana. Multilingual text generation from structured formal representations. 2013.

Abstract
This thesis aims to identify the optimal ways in which natural language generation techniques can be brought to bear upon the problem of processing a structured body of information in order to devise a coherent presentation of text content in multiple languages. We investigate how chains of referential expressions are realized in English, Swedish and Hebrew, and suggest several coreference strategies that can be used to generate coherent descriptions about paintings. The suggested strategies focus on the need to produce paragraphsized written natural language descriptions from formal structured representations presented in the Semantic Web. We account for principles of coreference by introducing a new modularized approach to automatically generate chains of referential expressions from ontologies. We demonstrate the feasibility of the approach by implementing a system where a Semantic Web domain ontology serves as the background knowledge representation and where the language-specific coreference strategies are incorporated. The system uses both the principles of discourse structures and coreference strategies to guide the generation process. We show how the system successfully generates coherent, well-formed descriptions in multiple languages.

Friberg Heppin, Karin. Resolving Power of Search Keys in MedEval - A Swedish Medical Test Collection with User Groups: Doctors and Patients. 2010.

Abstract
This thesis describes the making of a Swedish medical text collection, unique in its kind in providing a possibility to choose user group: doctors or patients. The thesis also describes a series of pilot studies which demonstrate what kind of studies can be performed with such a collection. The pilot studies are focused on search key effectivity: What makes a search key good, and what makes a search key bad? The need to bring linguistics and consideration of terminology into the information retrieval research field is demonstrated. Most information retrieval is about finding free text documents. Documents are built of terms, as are topics and search queries. It is important to understand the functions and features of these terms and not treat them like featureless objects. The thesis concludes that terms are not equal, but show very different behavior. The thesis addresses the problem of compounds, which, if used as search keys, will not match corresponding simplex words in the documents, while simplex words as search keys will not match corresponding compounds in the documents. The thesis discusses how compounds can be split to obtain more matches, without lowering the quality of a search. Another important aspect of the thesis is that it considers how different language registers, in this case those of doctors and patients, can be utilized to find documents written with one of the groups in mind. As the test collection contains a large set of documents marked for intended target group, doctors or patients, the language differences can be and are studied. The author comes up with suggestions of how to choose search keys if documents from one category or the other are desired. Information retrieval is a multi-disciplinary research field. It involves computer science, information science, and natural language processing. There is a substantial amount of research behind the algorithms of modern search engines, but even with the best possible search algorithm the result of a search will not be successful without an effective query constructed with effective search keys.

Olsson, Fredrik. Bootstrapping Named Entity Annotation by Means of Active Machine Learning: A method for creating corpora. 2008.

Abstract
This thesis describes the development and in-depth empirical investigation of a method, called BootMark, for bootstrapping the marking up of named entities in textual documents. The reason for working with documents, as opposed to for instance sentences or phrases, is that the BootMark method is concerned with the creation of corpora. The claim made in the thesis is that BootMark requires a human annotator to manually annotate fewer documents in order to produce a named entity recognizer with a given performance, than would be needed if the documents forming the basis for the recognizer were randomly drawn from the same corpus. The intention is then to use the created named entity recognizer as a pre-tagger and thus eventually turn the manual annotation process into one in which the annotator reviews system-suggested annotations rather than creating new ones from scratch. The BootMark method consists of three phases: (1) Manual annotation of a set of documents; (2) Bootstrapping – active machine learning for the purpose of selecting which document to annotate next; (3) The remaining unannotated documents of the original corpus are marked up using pre-tagging with revision.
Five emerging issues are identified, described and empirically investigated in the thesis. Their common denominator is that they all depend on the realization of the named entity recognition task, and as such, require the context of a practical setting in order to be properly addressed. The emerging issues are related to: (1) the characteristics of the named entity recognition task and the base learners used in conjunction with it; (2) the constitution of the set of documents annotated by the human annotator in phase one in order to start the bootstrapping process; (3) the active selection of the documents to annotate in phase two; (4) the monitoring and termination of the active learning carried out in phase two, including a new intrinsic stopping criterion for committee-based active learning; and (5) the applicability of the named entity recognizer created during phase two as a pre-tagger in phase three. The outcomes of the empirical investigations concerning the emerging issues support the claim made in the thesis. The results also suggest that while the recognizer produced in phases one and two is as useful for pre-tagging as a recognizer created from randomly selected documents, the applicability of the recognizer as a pre-tagger is best investigated by conducting a user study involving real annotators working on a real named entity recognition task.

Øvrelid, Lilja. Argument Differentiation: Soft constraints and data-driven models. 2008.

Abstract
The ability to distinguish between different types of arguments is central to syntactic analysis, whether studied from a theoretical or computational point of view. This thesis investigates the inﬂuence and interaction of linguistic properties of syntactic arguments in argument differentiation. Cross-linguistic generalizations regarding these properties often express probabilistic, or soft, constraints, rather than absolute requirements on syntactic structure. In language data, we observe frequency effects in the realization of syntactic arguments. We propose that argument differentiation can be studied using data-driven methods which directly express the relationship between frequency distributions in language data and linguistic categories. The main focus in this thesis is on the formulation and empirical evaluation of linguistically motivated features for data-driven modeling. Based on differential properties of syntactic arguments in Scandinavian language data, we investigate the linguistic factors involved in argument differentiation from two different perspectives. We study automatic acquisition of the lexical semantic category of animacy and show that statistical tendencies in argument differentiation supports automatic classiﬁcation of unseen nouns. The classiﬁcation is furthermore robust, generalizable across machine learning algorithms, as well as scalable to larger data sets. We go on to perform a detailed study of the inﬂuence of a range of different linguistic properties, such as animacy, deﬁniteness and ﬁniteness, on argument disambiguation in data-driven dependency parsing of Swedish. By including features capturing these properties in the representations used by the parser, we are able to improve accuracy signiﬁcantly, and in particular for the analysis of syntactic arguments. The thesis shows how the study of soft constraints and gradience in language can be carried out using data-driven models and argues that these provide a controlled setting where different factors may be evaluated and their inﬂuence quantiﬁed. By focusing on empirical evaluation, we come to a better understanding of the results and implications of the datadriven models and furthermore show how linguistic motivation in turn can lead to improved computational models.