# --------------------------------------------------------- #
# Description of contents of files:                         #
# parole_freq_1k                                            #
# parole_freq_10k                                           #
# parole_freq_50k                                           #
# parole_freq_100k                                          #
# parole_freq_1                                             #
# parole_freq_gt_1                                          #
# --------------------------------------------------------- #

The files contain different selections from the material 
described below. 

parole_freq_1k   : the   1000 most common words
parole_freq_10k  : the  10000 ------ " --------
parole_freq_50k  : the  50000 ------ " --------
parole_freq_100k : the 100000 ------ " --------
parole_freq_1    : all words with frequency 1                (see COMMENT below)
parole_freq_gt_1 : all words with frequency greater than 1

All files are sorted in frequency order.

The material consists of "every word" from the PAROLE corpus. "Every
word" means words remaining after the following treatment.


As words counts the result of a certain tokenization being
the base for the PAROLE corpus as shown in 
(http://spraakbanken.gu.se/parole/). Unfortunately the inner
workings of this tokenizer isn't documented. 

Tokens containing strings of digits or consisting of interpunction are
removed before the frequency is calculated and are therefore missing.
All words are normalized to lower-case.


The PAROLE corpus : circa 20 mill. running words modern (1976-1997)
written text material from different sources . More information (in Swedish): 
http://spraakbanken.gu.se/parole/

# --------------------------------------------------------- #
# COMMENT
# regarding parole_freq_1 :
  The major part of tokens with frequency=1 are of course
  "ordinary" low usage words. These include a large amount of 
	more or less temporary compounds. To be found are also a 
	significant amount of "oddities" resulting from poor tokenization.
	Some examples are the "words"
  "___________________g/km"
  "---dessa"
  "-/"
  "毽搞et"
# --------------------------------------------------------- #