Master Thesis proposals | Språkbanken Text

The topics for Master Theses in Language Technology Programme below are all targeting the same problem, namely, dealing with Swedish as a second language (other languages can also be relevant, though there is no guarantee that we will be able to help with data). As such, this problem can be addressed using various algorithms or methods. The area of L2 Swedish is currently taking off since new data is being collected and annotated, and we welcome your participation and help with various aspects of that. Shared tasks, crowdsourcing experiments, pseudonymization "on-the-fly" etc. are all on the current agenda.

Contact: Elena Volodina < elena . volodina @ svenska . gu . se >

This page is updated from time to time to add/remove topics, so please check it now and then.

TOPICS - SPRING TERM 2024

Error class-specific grammar checking for Swedish. (Elena Volodina, Arianna Masciolini) Using learner essays from the SweLL corpus as training data, you will develop machine learning models to correct one or more specific types of grammatical errors and compare the results with those obtained with general-purpose the available models. The long-term goal is to see whether several error class-specific models can be effectively combined and whether doing so comes with an advantage over general-purpose models, especially in terms of control and predictability of the results.
Bias and fairness in grammatical error correction. (Elena Volodina, Ricardo Muñoz Sánchez) Deep learning models tend to encode human-like biases, such as sexism and racism. These models are then put on high-stakes situations where they have a real impact on people's lives. One such example are tools for second language learning and evaluation. In this topic we cant to check which biases are present on grammatical error correction systems and to what extent they affect the systems. Then we would look into how to reduce the impacts of these biases for students. If you would like to read more about the topic or about other similar topics on bias and fairness in NLP, check this page.
Automatic analysis of Swedish vocabulary for word-building patterns. (Elena Volodina, and others from SBX) Given approximately 16.000 lexical items manually analyzed for prefixes, suffixes, roots etc, you will experiment with ways to automatically analyze unseen vocabulary for word-building patterns. Variations (other interpretations to the suggested topic) are possible. Dataset: CoDeRooMor (and more available through the Swedish L2 profile)
Synthetic error datasets for Swedish learner language (Elena Volodina and others at SBX) Based on the available error-annotated learner data for Swedish (SweLL-gold), set up a pipeline for generating synthetic error dataset(s). Datasets: SweLL-gold; Swedish L1 corpora
Shared task on Swedish learner language. (Elena Volodina et al.) Focus is on preparation of data, evaluation algorithms and baselines for a shared task (the topic of the task will be discussed, e.g. error correction, automatic essay scoring, L1 identification; etc.). Data available: 500 SweLL-gold essays manually annotated for a number of various aspects including errors. All essays are linked to numerous personal metadata (e.g. mother tongue, age, education, etc) and text-related metadata (genres, topics, etc).

TOPICS - SPRING TERM 2023

Participation in a MuClaGED Shared task - Multi-Class Grammatical Error Detection (incl. preliminary L2 data for Swedish, English, Italian, German, Czech). Focus on one of the languages or many; datasets can be used across languages. -- Optimally, several people could collaborate, joining topic 1 and 3.
- given data for several languages, solve the tasks:
- binary classification of a sentence level (correct-incorrect)
- binary classification on a token level (error-no error)
- classify error type by one of the five "POLMS" error types (punctuation, orthography, lexical, morphology, syntax)
- classify by the "ADR" editing operation needed to correct an error (addition, deletion, replacement)
- supplement the available data by syntactic datasets (generate yourself and share) to boost the results
Prepare data and baselines for a shared task, other than in (1) -- the topic of the task will be discussed, e.g. automatic essay scoring; L1 identification; error correction, etc.).
- Data available: 500 SweLL-gold essays manually annotated for a number of various aspects including errors.
- All essays are linked to numerous personal metadata (e.g. mother tongue, age, education, etc) and text-related metadata (genres, topics, etc).
Synthetic dataset generation --
- experimenting with several ways to generate error datasets to enhance existing data for L2 Swedish
- using SweLL-gold error-annotated data
Digital SFI / Swedish for Ukranians / Swedish for "Outsiders"- buidling foundations for automatic generation of learning materials/ set-up of online courses (several possible topics):
- course authoring tools based on texts (texts w/out copyright issues are available)
- TTR for texts
- vocabulary exercise generation + answer logging
- grammar exercise generation (see even topic 5) + answer logging
- crossword puzzles
- question generation to texts
- etc.
Exercise generation -- based on available texts in Swedish, generate exercises for training grammar of Swedish, e.g.
- nouns - forms, plural/singular, agreement with adjectives, etc
- verbs - tenses, aspects, passives, etc
- word order - verb placemenr, adverbial placement, question, etc.
- word formation - affixation, compounding, etc.
- etc.
Automatic analysis of Swedish vocabulary for word-building patterns -- Given approximately 16.000 lexical items manually analyzed for prefixes, suffixes, roots etc, you will experiment with ways to automatically analyze unseen vocabulary
- segment words into constituent morphemes
- label morphemes for their types (prefix, suffix, root, binding morpheme)
- label for word-building patterns.
- variations/other interpretations to the suggested topic are possible.
- dataset: CoDeRooMor (and more available through the Swedish L2 profile)
Analysis of second language complexity. Based on the SweLL-pilot and SweLL-gold corpora, as well as the Swedish L2 profiles resource, you will work on the various aspects of linguistic complexity (e.g. lexical, morphological, grammatical), potentially including automatic classification of essays into levels of proficiency.
Automatic normalization of Swedish learner essays. Focus is on detection of strings that need to be corrected. 500 manually normalized SweLL-gold essays are available for experiments.
Pseudonymization in the age of GDPR. Automatic detection, labeling and pseudonymization of personal information in unstructured texts written by language learners. For Swedish, approx 600 manually pseudonymized essays are available. For English, rule-based approaches are a start, and crowdsourcing can be employed to collect more pseudonymized data.

[presentation slides from 2021]

OLDER TOPICS (but yet relevant)

1. Text categorization by topics

Goal: Testing/comparing approaches to text categorization/topic modeling based on coursebook texts labeled for topics.

Background: A topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. The main purpose of testing approaches to topic modeling in this project is identification of the best-performing approach that can eventually be used for selection of texts for learners by their topic of preference. These models may eventually be embedded into Lärka, an application developed at Språkbanken for learning Swedish as a second language.

Recently, we have compiled COCTAILL, a corpus of coursebooks for learning Swedish as a second language, where each text is labeled with a topic (or a set of topics). This corpus will form the training/testing data for topic modeling experiments.

Problem descriptionThe aims of this work include the following:

to study literature on topic modeling
to test/compare several of the suggested ways for text categorization/topic modeling for (some of?) the topics present in the COCTAILL corpus (total of 28 topics used at 5 proficiency levels)
apply developed algorithms to some real-life texts (e.g. from Korp or from web) to assess their performance.

Recommended skills: Python, (maybe R)
Supervisor(s): Elena Volodina, potentially others from Språkbanken

2. Overcoming semantic challenges in selection of distractors for multiple-choice vocabulary exercises

Goal: Find a way to make sure that distractors in multiple-choice activities are genuine (i.e. cannot be used instead of the correct answer) in the context of a sentence/exercise item. This is primarily aimed at Swedish, but other languages are possible candidates as well.

Background: Multiple-choice items for training vocabulary knowledge is a well-documented format of exercise. However, when it comes to the automatic generation of this exercise type, it becomes a complicated problem to select genuinely appropriate distractors. For example, if a learner wants to train vocabulary from the topical domain of “Medical services and SOS”, answer options from the same topical domain might be generated as follows:

Parents couldn't afford to buy the necessary _________.

Choices: pincers, medicine, tablets, blood, hospital, nurse (correct answer in bold)

More than one alternative from the example above can be used to fill the gap (i.e. pincers, medicine, tablets). However, it is important to be able to select such distractors that cannot replace the correct answer, semantically or collocationally viewed, in the context of a sentence, e.g. in the case above to suggest choices: medicine, blood, hospital, nurse, emergency room

Problem description: The aim of this work is thus:

to study the literature on the topic of distractor selection, lexical semantics and context modeling
implement/test some approach(es) for a semantically aware selection of distractors
embed the selection algorithm into Lärka as a web service
evaluate/test on users (language learners, teachers, linguists, etc.)

Recommended skills: Python, interest in Lexical Semantics

Supervisor(s): Elena Volodina, potentially others from Språkbanken/FLOV

3. Classification of learner essays by achieved proficiency level

Goal: Developing an algorithm (web services) for automatic classification of Swedish learner essays by their reached proficiency level.

Background: Suggested approach would be to use machine learning for essay classification. The challenge is to identify features that would be both aware of the Second Language Acquisition (SLA) research and informative of the task at hand.

The classification will be made in terms of the levels of proficiency according to the Common European Framework of Reference (CEFR), which covers 6 learner levels: A1 (beginner), A2, B1, B2, C1, C2 (near-native). At the moment we have electronic corpora of essays at levels B1, B2, and C1. Essays at A2 are hand-written and haven't yet been digitized and annotated (which presumingly can be done in time for the project, if someone picks this topic).

Problem description: The steps for this project would include:

background reading on the topic of SLA, CEFR, essay grading and learner essay classification by levels. See one example for Swedish essay grading (NOT in terms of levels, but in terms of grades, i.e. (Väl/Icke) Godkänd: http://www.ling.su.se/english/nlp/tools/automated-essay-scoring
testing approaches for the best-performing classification
implementation of web service(s) for learner essay classification
(potentially) implementation of Lärka-based user interface where new essays can be tested
(potentially) evaluation of the results with teachers & new essays

Recommended skills: Python, jQuery, interest in machine learning

Supervisor(s): Elena Volodina, potentially others from Språkbanken/FLOV

4. Developing an adaptive diagnostic vocabulary/grammar test for Swedish

Goal: Implement an adaptive diagnostic test for vocabulary and/or grammar for Swedish, based on Second Language Acquisition (SLA) research and frequency statistics available from the COCTAILL corpus.

Background: The currently developed application Lärka, www.spraakbanken.gu.se/larka, is intended for computer-assisted language learning of L2 Swedish. Lärka generates a number of exercises based on corpora available through Korp. Attempts are being made to align generated exercises with CEFR proficiency scales (http://www.coe.int/t/dg4/linguistic/Source/Framework_en.pdf). The actual users, however, may not know their level when they start working with the exercise generator. It is therefore important (and user-friendly) to offer some sort of placement/diagnostic test for those who may need it.

Some examples of existing diagnostic tests for vocabulary are:

CATSS – Computer-Adaptive Test of Size and Strength http://hcc.haifa.ac.il/~blaufer/
Levels tests (by Tom Cobb) http://www.er.uqam.ca/nobel/r21270/levels/
Levels Test (by Paul Nation) http://www.victoria.ac.nz/lals/about/staff/paul-nation

Problem description: The aims of this work are the following:

to study literature on diagnostic testing for different language skills and competences that are relevant for the CEFR;
to find out about other “actors” dealing with CEFR-based tests for Swedish, especially for placement/diagnosis; as a result, to suggest a format for a placement test for one or (better) a range of language skills and competences mentioned in the CEFR
to implement the suggested test(s) in the form of web services that can be embedded into Lärka platform (+ eventually develop the user interface module for that). Here it would be interesting, for example, to see formats where free answers could be provided and scored
evaluate/test on users (language learners, teachers, linguists, etc.)

Recommended skills: Python, interest in Lexical Semantics

Supervisor(s): Elena Volodina, potentially others from Språkbanken

5. Collocations for learners of Swedish

Goal: Generate a list of collocations, phrasal verbs, set phrases and idioms important for learners of Swedish, linked to proficiency levels, for use in Lärka. Potentially - deveelop exewrcises based on the list of ranked multi-word expessions.

Background: The currently developed application Lärka, www.spraakbanken.gu.se/larka, is intended for computer-assisted language learning of L2 Swedish. Lärka generates a number of exercises based on corpora available through Korp, one of them focusing on vocabulary. It has been mentioned on several occasions that we should include multi-word expressions into our exercise generator. This also complies with the CEFR “can-do” statements at different levels of proficiency (http://www.coe.int/t/dg4/linguistic/Source/Framework_en.pdf). It is, however, a non-trivial task to identify the items that should be included into the curriculum, and even more uncertain how the selected items can be assigned to different proficiency levels.

Problem description: The aims of this work are the following:

to study literature on collocations etc. in general and in the L2 context especially, paying special attention to the CEFR guidelines; to make an overview of the practices for training collocations etc. used in other applications and in (online) dictionaries/lexicons
to generate a list of collocations, (primarily) by automatic analysis of COCTAILL - a corpus of coursebook texts used for teaching Swedish. Study of different materials available outside COCTAILL, e.g. books written by Anna Hallström, multi-word expressions in Saldo and Lexin, may also prove to be beneficial, however, the challenge would be to define at which level these items should be introduced. To get some inspiration, have a look at English Vocabulary Profile: http://vocabulary.englishprofile.org/staticfiles/about.html (user: englishprofile, password: vocabulary)
(potentially) to implement one or more of the suggested exercise formats as web services + user interface in Lärka
evaluate/test on users (language learners, teachers, linguists, etc)

Recommended skills: Python, interest in Lexical Semantics and Second Language Acquisition

Supervisor(s): Elena Volodina, potentially others from Språkbanken/FLOV