Files in this directory are made available under the following conditions: # --------------------------------------------------------- # # ---- license ---- # # --------------------------------------------------------- # # Copyright (c) 2004 Språkbanken, Göteborgs universitet # # Permission is hereby granted, free of charge, to any person obtaining a # copy of this resource and associated documentation files (the # "Resource"), to deal in the Resource without restriction, including # without limitation the rights to use, copy, modify, merge, publish, # distribute, sublicense, and/or sell copies of the Resource, and to # permit persons to whom the Resource is furnished to do so, subject to # the following conditions: # # The above copyright notice and this permission notice shall be included # in all copies or substantial portions of the Resource. # # THE RESOURCE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS # OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF # MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. # IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY # CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, # TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE # RESOURCE OR THE USE OR OTHER DEALINGS IN THE RESOURCE. # --------------------------------------------------------- # Files in this directory : license.txt README.txt # this file The actual resources: p00-01_statlist_sc_utf8.tsv.zip p00-01_statlist_sc_utf8.xml.zip The material in this directory presents frequency listings from the corpus "p00-01". The corpus consists of 51.749.999 mill. tokens collected from three of the four major swedish newspapers during periods given below. running words short name full name 21.733.352 gp Göteborgs-Posten 10.269.642 sds Sydsvenska Dagbladet Snällposten 19.747.005 svd Svenska Dagbladet collection periods : sds period articles 2000-02-07 - 2000-03-14 9007 2000-05-15 - 2000-06-19 8688 2000-08-21 - 2000-09-25 8371 2000-11-27 - 2001-01-04 8386 gp 2001-01-02 - 2001-10-31 81137 svd 2000-01-03 - 2000-12-31 52844 total 16K The listings are : standard case (not normalized) unicode (utf8) For your convenience the files comes in two different file formats: tsv (tab separated values) xml (extensible markup language) In the UNIX environment the most frequent compression format is tgz. In Windows environments however the PKZIP format is almost universally adopted. Therefore Windows files are made avalable solely as zip archives. The combination of the above is reflected in the filenames. Thus dn87_statlist_sc_utf8.xml.zip is a zip archive containing a xml file in utf8 encoding showing a standard case version of a list of words from the corpus p00-01.