Menu

Master Thesis proposals

The topics for Master Theses in Language Technology Programme below are all targeting the same problem, namely, dealing with Swedish as a second language (other languages can also be relevant, though there is no guarantee that we will be able to help with data). As such, this problem can be addressed using various algorithms or methods . The area of L2 Swedish is currently taking off since new data is being collected and annotated, and we welcome your participation and help with various aspects of that. Shared tasks, crowdsourcing experiments, pseudonymization "on-the-fly" etc. are all on the current agenda.

This page is updated from time to time to add/remove topics, so please check it now and then.

 

1. Text categorization by topics

Goal

Testing/comparing approaches to text categorization/topic modeling based on coursebook texts labeled for topics.

Background

A topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. The main purpose of testing approaches to topic modeling in this project is identification of the best-performing approach that can eventually be used for selection of texts for learners by their topic of preference. These models may eventually be embedded into Lärka, an application developed at Språkbanken for learning Swedish as a second language.

Recently, we have compiled COCTAILL, a corpus of coursebooks for learning Swedish as a second language, where each text is labeled with a topic (or a set of topics). This corpus will form the training/testing data for topic modeling experiments.


Problem description

The aims of this work include the following:

  • to study literature on topic modeling
  • to test/compare several of the suggested ways for text categorization/topic modeling for (some of?) the topics present in the COCTAILL corpus (total of 28 topics used at 5 proficiency levels)
  • apply developed algorithms to some real-life texts (e.g. from Korp or from web) to assess their performance.


Recommended skills:

  • Python, (maybe R)


Supervisor(s)

  • Elena Volodina
  • potentially others from Språkbanken

 

2. Overcoming semantic challenges in selection of distractors for multiple-choice vocabulary exercises

Goal

Find a way to make sure that distractors in multiple-choice activities are genuine (i.e. cannot be used instead of the correct answer) in the context of a sentence/exercise item. This is primarily aimed at Swedish, but other languages are possible candidates as well.


Background

Multiple-choice items for training vocabulary knowledge is a well-documented format of exercise. However, when it comes to the automatic generation of this exercise type, it becomes a complicated problem to select genuinely appropriate distractors. For example, if a learner wants to train vocabulary from the topical domain of “Medical services and SOS”, answer options from the same topical domain might be generated as follows:

Parents couldn't afford to buy the necessary _________.

Choices: pincers, medicine, tablets, blood, hospital, nurse (correct answer in bold)

More than one alternative from the example above can be used to fill the gap (i.e. pincers, medicine, tablets). However, it is important to be able to select such distractors that cannot replace the correct answer, semantically or collocationally viewed, in the context of a sentence, e.g. in the case above to suggest choices: medicine, blood, hospital, nurse, emergency room


Problem description

The aim of this work is thus:

  • to study the literature on the topic of distractor selection, lexical semantics and context modeling
  • implement/test some approach(es) for a semantically aware selection of distractors
  • embed the selection algorithm into Lärka as a web service
  • evaluate/test on users (language learners, teachers, linguists, etc.)


Recommended skills:

  • Python
  • interest in Lexical Semantics


Supervisor(s)

  • Elena Volodina
  • potentially others from Språkbanken/FLOV

 

3. Classification of learner essays by achieved proficiency level

Goal

Developing an algorithm (web services) for automatic classification of Swedish learner essays by their reached proficiency level.


Background

Suggested approach would be to use machine learning for essay classification. The challenge is to identify features that would be both aware of the Second Language Acquisition (SLA) research and informative of the task at hand.

The classification will be made in terms of the levels of proficiency according to the Common European Framework of Reference (CEFR), which covers 6 learner levels: A1 (beginner), A2, B1, B2, C1, C2 (near-native). At the moment we have electronic corpora of essays at levels B1, B2, and C1. Essays at A2 are hand-written and haven't yet been digitized and annotated (which presumingly can be done in time for the project, if someone picks this topic).


Problem description

The steps for this project would include:

  • background reading on the topic of SLA, CEFR, essay grading and learner essay classification by levels. See one example for Swedish essay grading (NOT in terms of levels, but in terms of grades, i.e. (Väl/Icke) Godkänd: http://www.ling.su.se/english/nlp/tools/automated-essay-scoring
  • testing approaches for the best-performing classification
  • implementation of web service(s) for learner essay classification
  • (potentially) implementation of Lärka-based user interface where new essays can be tested
  • (potentially) evaluation of the results with teachers & new essays


Recommended skills:

  • Python
  • jQuery
  • interest in machine learning


Supervisor(s)

  • Elena Volodina
  • potentially others from Språkbanken/FLOV

 

4. Developing an adaptive diagnostic vocabulary/grammar test for Swedish

Goal

Implement an adaptive diagnostic test for vocabulary and/or grammar for Swedish, based on Second Language Acquisition (SLA) research and frequency statistics available from the COCTAILL corpus.


Background

The currently developed application Lärka, www.spraakbanken.gu.se/larka, is intended for computer-assisted language learning of L2 Swedish. Lärka generates a number of exercises based on corpora available through Korp. Attempts are being made to align generated exercises with CEFR proficiency scales (http://www.coe.int/t/dg4/linguistic/Source/Framework_en.pdf). The actual users, however, may not know their level when they start working with the exercise generator. It is therefore important (and user-friendly) to offer some sort of placement/diagnostic test for those who may need it.

Some examples of existing diagnostic tests for vocabulary are:

  • CATSS – Computer-Adaptive Test of Size and Strength http://hcc.haifa.ac.il/~blaufer/
  • Levels tests (by Tom Cobb) http://www.er.uqam.ca/nobel/r21270/levels/
  • Levels Test (by Paul Nation) http://www.victoria.ac.nz/lals/about/staff/paul-nation


Problem description

The aims of this work are the following:

  • to study literature on diagnostic testing for different language skills and competences that are relevant for the CEFR;
  • to find out about other “actors” dealing with CEFR-based tests for Swedish, especially for placement/diagnosis; as a result, to suggest a format for a placement test for one or (better) a range of language skills and competences mentioned in the CEFR
  • to implement the suggested test(s) in the form of web services that can be embedded into Lärka platform (+ eventually develop the user interface module for that). Here it would be interesting, for example, to see formats where free answers could be provided and scored
  • evaluate/test on users (language learners, teachers, linguists, etc.)


Recommended skills:

  • Python
  • interest in Lexical Semantics


Supervisor(s)

  • Elena Volodina
  • potentially others from Språkbanken

 

5. Collocations for learners of Swedish

Goal

Generate a list of collocations, phrasal verbs, set phrases and idioms important for learners of Swedish, linked to proficiency levels, for use in Lärka.


Background

The currently developed application Lärka, www.spraakbanken.gu.se/larka, is intended for computer-assisted language learning of L2 Swedish. Lärka generates a number of exercises based on corpora available through Korp, one of them focusing on vocabulary. It has been mentioned on several occasions that we should include multi-word expressions into our exercise generator. This also complies with the CEFR “can-do” statements at different levels of proficiency (http://www.coe.int/t/dg4/linguistic/Source/Framework_en.pdf). It is, however, a non-trivial task to identify the items that should be included into the curriculum, and even more uncertain how the selected items can be assigned to different proficiency levels.


Problem description

The aims of this work are the following:

  • to study literature on collocations etc. in general and in the L2 context especially, paying special attention to the CEFR guidelines; to make an overview of the practices for training collocations etc. used in other applications and in (online) dictionaries/lexicons
  • to generate a list of collocations, (primarily) by automatic analysis of COCTAILL - a corpus of coursebook texts used for teaching Swedish. Study of different materials available outside COCTAILL, e.g. books written by Anna Hallström, multi-word expressions in Saldo and Lexin, may also prove to be beneficial, however, the challenge would be to define at which level these items should be introduced. To get some inspiration, have a look at English Vocabulary Profile: http://vocabulary.englishprofile.org/staticfiles/about.html (user: englishprofile, password: vocabulary)
  • (potentially) to implement one or more of the suggested exercise formats as web services + user interface in Lärka
  • evaluate/test on users (language learners, teachers, linguists, etc)


Recommended skills

  • Python
  • interest in Lexical Semantics and Second Language Acquisition


Supervisor(s)

  • Elena Volodina
  • potentially others from Språkbanken/FLOV