A lexical resource derived from books published by Sweden´s largest publisher of easy language texts. The entries are annotated with frequency counts distributed over six reading proficiency levels.
I. IDENTIFYING INFORMATION | |
Title* | NyLLex v 2.0 |
Subtitle | A Novel Resource of Swedish Words Annotated with Reading Proficiency Level |
Created by* | Daniel Holmer (daniel.holmer@liu.se), Evelina Rennes (evelina.rennes@liu.se) |
License(s)* | CC BY 4.0 |
Abstract* | NyLLex is a lexical resource derived from books published by Sweden´s largest publisher of easy language texts. The entries are annotated with frequency counts distributed over six reading proficiency levels. |
Funded by* | Vetenskapsrådet (2020-03580) |
Cite as | [1] |
Related datasets | [2], [3] |
II. USAGE | |
Key applications | Text complexity analysis |
Intended task(s)/usage(s) | (1) Lexical analysis of easy language texts. (2) Lexical simplification |
Recommended evaluation measures | - |
Dataset function(s) | - |
Recommended split(s) | - |
III. DATA | |
Primary data* | Words (text) |
Language* | Swedish |
Dataset in numbers* | 14983 entries |
Nature of the content* | Each entry in the resource contains a word, its part-of-speech tag (SUC-style), and a number of frequencies over different readability levels. Multi-word expressions are denoted by multiple words linked by underscores. |
Format* | Comma-separated values (CSV) with the following columns: |
word: a word in its lemma form | |
POS: a part-of-speech tag in the SUC-format | |
level1_freq - level6_freq (six headers): the dispersed frequency of the word in the given reading proficiency level | |
total_freq: the adjusted frequency for the word across all reading proficiency levels | |
n_level1 - n_level6 (six headers): raw frequency of the word in the given reading proficiency level | |
n_total: raw frequency for the word across all reading proficiency levels | |
Data source(s)* | The words are collected from 247 easy language books published by NyponVilja förlag. The books were OCR-scanned from PDF-format and preprocessed by the authors. Unfortunately, the book dataset is not publicly available due to copyright reasons. |
Data collection method(s)* | See [1] |
Data selection and filtering* | See [1] |
Data preprocessing* | See [1] |
Data labeling* | "See "Format"" |
Annotator characteristics | - |
IV. ETHICS AND CAVEATS | |
Ethical considerations | The books contain words that when taken out of context can be seen as offensive. The authors have manually removed such entries, but can not guarantee that the resource is completely devoid of offensive words. |
Things to watch out for | - |
V. ABOUT DOCUMENTATION | |
Data last updated* | 20220909 |
Which changes have been made, compared to the previous version* | This version contain more entries than described in the original paper. This is due to two reasons: 1) An increased number of books available for the source material (from 247 to 280). 2) An updated method to filter out bad entries due to erraneous OCR-readings from the soruce PDFs. In practice, this means that the number of entries (unique words) of the resource is signifcantly larger (more than double the number of entries) in this version, since entries that only appear once in the source material are no longer discarded. However, for the total frequency counts for all entries, the difference between this updated version and the paper version is only around 2%. |
Access to previous versions | - |
This document created* | 20221219, Daniel Holmer (daniel.holmer@liu.se) |
This document last updated* | 20230608, Aleksandrs Berdicevskis (aleksandrs.berdicevskis@gu.se) |
Where to look for further details | See [1] and https://gitlab.liu.se/danho69/nyllex/ |
VI. OTHER | |
Related projects | |
References |
"[1]. Daniel Holmer and Evelina Rennes. 2022. NyLLex: A Novel Resource of
Swedish Words Annotated with Reading Proficiency Level. In Proceedings of
the Thirteenth Language Resources and Evaluation Conference, pages
1326–1331, Marseille, France. European Language Resources Association.
https://aclanthology.org/2022.lrec-1.141.pdf [2]. Thomas François, Elena Volodina, Ildikó Pilán, Anaïs Tack. 2016. SVALex: a CEFR-graded lexical resource for Swedish foreign and second language learners. Proceedings of LREC 2016, Slovenia. [3]. Elena Volodina, Ildikó Pilán, Lorena Llozhi, Baptiste Degryse, and Thomas François. 2016. SweLLex: Second language learners’ productive vocabulary. In Proceedings of the joint workshop on NLP for Computer Assisted Language Learning and NLP for Language Acquisition, pages 76–84, Umeå, Sweden. LiU Electronic Press." |