Lexical resources for Natural Language Processing (NLP), Second Language Acquisition (SLA) and other applied disciplines differ in the choice of the lexical units they use as their main entry. Most widely-spread is use of a lemma, i.e. base form of a word, or a lemgram, i.e. base form + its part of speech (POS), cf François et al. (2016) and Kilgarriff et al. (2014). This is possibly due to the ease of creation of such resources using automatic annotation pipelines and the similarity of the result to dictionaries. Other approaches use flemmas which group all related polysemous/homonymous forms (e.g. pause, noun and to pause, verb) and their inflected forms in a unique entry (e.g. Stoeckel et al. 2020); senses where each meaning of a homonymous or polysemous lemgram is listed separately (e.g. Capel 2010, 2012); or word families where words (tokens, flemmas, lemgrams) sharing the same root/base are grouped together (e.g. Gardner and Davies 2014).
Needless to say, the choice of the main lexical unit in a vocabulary list has a significant impact on its application scenarios. There are debates around the appropriateness of different types of lexical units in different contexts (e.g. see Brown 2018 for a discussion within the context of language learning), each of the units having their upsides and downsides. Without taking any sides, the Swedish L2 Profile (SweL2P) offers a possibility to explore and apply vocabulary in several ways: using lemgrams, sense-based lemgrams (cf lexemes) and word families (grouped around a shared root). We hypothesize that the value of different approaches can best be studied if it is possible to compare the same lexical items organized in several ways.
Using the approaching Christmas as an excuse, we have looked through the about-to-be-launched Swedish Word Family resource to see what it tells us about Christmas (Swedish jul). The Swedish jul-family can boast 23 members based on the corpora for Swedish as a second language, see an excerpt below taken from the online tool for browsing the Swedish L2 word family resource. Each word in the Figure below has a label for the level of the text where the word has been observed for the first time.
We can see that the noun jul appears for the first time at the beginner level (A1), and the family keeps growing until level B2 (upper intermediate), after which we do not observe new lemgrams containing jul any more.
|CEFR level||Textbook corpus||Learner essay corpus|
|A2||julbord, juldagsmorgon, julklapp, jullov, julotta, jultomte, God Jul!||jul, jullov, julgran|
|B2||julafton, julbock, juldag, julfirande, julgåva, julgran, julhandel, julklappsrim, julkort, julmat, julöl, julpynt, julpynta (verb), jultidning||julpynt|
The root jul is obviously very productive, and is used in plenty of compounds – all nouns except the verb julpynta and interjection God Jul! A glance at the list above can also reveal that a lot of other existing compounds are not represented in the list, such as julmust, julgröt, julstök, julpyssel and many, many others.
The size of the family suggests how popular this holiday season is. But is it really so? How does it compare to other (national) holidays? Let’s have a look at families for (New) Year, Easter, (Mid)summer and Halloween (with a corresponding Swedish allahelgonadag holiday).
- If we take a look at the family for år (year) and select members with reference to New Year, we will be able to list five (5) members: (A2) nyår, nyårsafton, nyårslöfte, Gott Nytt År; (B1) nyårsdag.
- The family for påsk (Easter) contains seven (7) members: (A2) påsk, påsklov; (B1) i påskas, påskafton, påskkärring; (B2) påskägg, påskdag.
- (Mid)summer-family contains five (5) members related to the holiday: (A2) midsommar, midsommarafton, midsommarstång; and (B2) midsommardag, midsommarnatt.
- Halloween is not at all represented in the two corpora used for constructing Swedish L2 word family; however the more inherent Swedish holiday, allhelgonadag, is represented with one family member in the helg-family.
The holidays, can, thus be ordered based on their family size, as follows: Christmas (23) > Easter (7) > New Year (5) >Midsummer (5) > Halloween/Allahelgonadag (1). What does this tell us? We may assume that immigrants taking courses in Swedish as a second language learn that traditions around Christmas are far more important than traditions connected to other holidays. Besides, we can observe that holidays related to religion come first (Christmas and Easter). It provides some food for thought as to why other holidays are less represented. Linguistically speaking, we can see that most vocabulary connected to holidays consists of nominal compounds, with – interestingly – only ONE verb, julpynta, among the holiday-related vocabulary, and a few interjections in the L2 data (compare presence of påskstäda, julhandla and fira jul and other examples in other data sources).
The Swedish L2 word family resource presents an opportunity to trace various other trends in the language, among others derivational morphology and word building mechanisms. We expect to release the resource during 2022. Look out for more news about it!
With this – we wish you a Merry Christmas and a Happy New Year!
Elena Volodina, Therese Lindström Tiedemann and Yousuf Ali Mohammed (the Swedish L2 profiles project)