Skip to main content

What do the corpus statistic files contain?

For most of our corpora downloadable files with statistics are available. During a transitional period there are two different formats available but eventually the new format will replace the old one. Both formats hold lists of word forms sorted by frequency but they contain different columns.

Some files are very large in size. Therefore it might be best to download a file instead of viewing it directly in your browser. You can download a file by right-clicking it and selecting to save it.

Columns in the old format

  1. word form (all word forms occurring in the corpus)
  2. part-of-speech (tagset can be found here)
  3. lemgram (if found)
  4. +/- indicating whether a compound analysis was found
  5. raw frequency (total number of occurrences)
  6. relative frequency (number of occurrences per one million words)

Columns in the new format

The following columns are present in all files in the new format (but some files may contain additional columns):

  1. word form (all word forms occurring in the corpus)
  2. part-of-speech (tagset can be found here)
  3. baseform (if found)
  4. SALDO sense (if found)
  5. lemgram (if found)
  6. compound analysis (for words that are lacking a SALDO sense and a compound analysis was found)
  7. raw frequency (total number of occurences)

The statistics data is licensed under the CC BY 4.0 license.