NyLLex v2

Data citation

Språkbanken (2023). NyLLex v2 (updated: 2023-06-09). [Data set]. Enriched and distributed by Språkbanken. https://doi.org/10.23695/gp75-6148

Additional ways to cite the dataset.

A lexical resource derived from books published by Sweden´s largest publisher of easy language texts. The entries are annotated with frequency counts distributed over six reading proficiency levels.

I. IDENTIFYING INFORMATION
Title*	NyLLex v 2.0
Subtitle	A Novel Resource of Swedish Words Annotated with Reading Proficiency Level
Created by*	Daniel Holmer (daniel.holmer@liu.se), Evelina Rennes (evelina.rennes@liu.se)
License(s)*	CC BY 4.0
Abstract*	NyLLex is a lexical resource derived from books published by Sweden´s largest publisher of easy language texts. The entries are annotated with frequency counts distributed over six reading proficiency levels.
Funded by*	Vetenskapsrådet (2020-03580)
Cite as	[1]
Related datasets	[2], [3]

II. USAGE
Key applications	Text complexity analysis
Intended task(s)/usage(s)	(1) Lexical analysis of easy language texts. (2) Lexical simplification
Recommended evaluation measures	-
Dataset function(s)	-
Recommended split(s)	-

III. DATA
Primary data*	Words (text)
Language*	Swedish
Dataset in numbers*	14983 entries
Nature of the content*	Each entry in the resource contains a word, its part-of-speech tag (SUC-style), and a number of frequencies over different readability levels. Multi-word expressions are denoted by multiple words linked by underscores.
Format*	Comma-separated values (CSV) with the following columns:
	word: a word in its lemma form
	POS: a part-of-speech tag in the SUC-format
	level1_freq - level6_freq (six headers): the dispersed frequency of the word in the given reading proficiency level
	total_freq: the adjusted frequency for the word across all reading proficiency levels
	n_level1 - n_level6 (six headers): raw frequency of the word in the given reading proficiency level
	n_total: raw frequency for the word across all reading proficiency levels
Data source(s)*	The words are collected from 247 easy language books published by NyponVilja förlag. The books were OCR-scanned from PDF-format and preprocessed by the authors. Unfortunately, the book dataset is not publicly available due to copyright reasons.
Data collection method(s)*	See [1]
Data selection and filtering*	See [1]
Data preprocessing*	See [1]
Data labeling*	"See "Format""
Annotator characteristics	-

IV. ETHICS AND CAVEATS
Ethical considerations	The books contain words that when taken out of context can be seen as offensive. The authors have manually removed such entries, but can not guarantee that the resource is completely devoid of offensive words.
Things to watch out for	-

V. ABOUT DOCUMENTATION
Data last updated*	20220909
Which changes have been made, compared to the previous version*	This version contain more entries than described in the original paper. This is due to two reasons: 1) An increased number of books available for the source material (from 247 to 280). 2) An updated method to filter out bad entries due to erraneous OCR-readings from the soruce PDFs. In practice, this means that the number of entries (unique words) of the resource is signifcantly larger (more than double the number of entries) in this version, since entries that only appear once in the source material are no longer discarded. However, for the total frequency counts for all entries, the difference between this updated version and the paper version is only around 2%.
Access to previous versions	-
This document created*	20221219, Daniel Holmer (daniel.holmer@liu.se)
This document last updated*	20230608, Aleksandrs Berdicevskis (aleksandrs.berdicevskis@gu.se)
Where to look for further details	See [1] and https://gitlab.liu.se/danho69/nyllex/

VI. OTHER
Related projects
References	"[1]. Daniel Holmer and Evelina Rennes. 2022. NyLLex: A Novel Resource of Swedish Words Annotated with Reading Proficiency Level. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 1326–1331, Marseille, France. European Language Resources Association. https://aclanthology.org/2022.lrec-1.141.pdf [2]. Thomas François, Elena Volodina, Ildikó Pilán, Anaïs Tack. 2016. SVALex: a CEFR-graded lexical resource for Swedish foreign and second language learners. Proceedings of LREC 2016, Slovenia. [3]. Elena Volodina, Ildikó Pilán, Lorena Llozhi, Baptiste Degryse, and Thomas François. 2016. SweLLex: Second language learners’ productive vocabulary. In Proceedings of the joint workshop on NLP for Computer Assisted Language Learning and NLP for Language Acquisition, pages 76–84, Umeå, Sweden. LiU Electronic Press."

Annotation

see description

Caveats

see description

References

Daniel Holmer and Evelina Rennes. 2022. NyLLex: A Novel Resource of Swedish Words Annotated with Reading Proficiency Level. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 1326–1331, Marseille, France. European Language Resources Association. https://aclanthology.org/2022.lrec-1.141.pdf

Download

File	Size	Modified	Licence
nyllex_v2.csv lexicon (CSV)	1.46 MB	2023-06-09	CC-BY-4.0

Data citation

Annotation

Caveats

References

Download

Type

Language

Size

Updated

Contact

DOI