Skip to main content

The Swedish Culturomics Gigaword Corpus

One billion Swedish words from 1950 and onwards. Code to extract data from the corpus, as well as usage instructions, can be downloaded from https://svn.spraakdata.gu.se/sb-arkiv/tools/gigaword/

One billion Swedish words from 1950 and onwards.

Please reference the dataset using the following reference:
Stian Rødven Eide, Nina Tahmasebi, Lars Borin. 2016. The Swedish Culturomics Gigaword Corpus: A One Billion Word Swedish Reference Dataset for NLP

Code to extract data from the corpus, as well as usage instructions, can be downloaded from https://svn.spraakdata.gu.se/sb-arkiv/tools/gigaword/

Sentences per year for each genre
fiction government news science socialmedia
1950 - 420 413 - - -
1960 - 424 920 - - -
1965 - - 53 624 - -
1970 - 459 867 - - -
1976 - - 89 175 - -
1977 499 030 - - - -
1980 - 534 194 - - -
1981 307 597 - - - -
1987 97 398 - 364 226 - -
1990 - 551 988 - - -
1991 330 127 - - - -
1992 - - - 44 538 -
1994 - 391 882 1 538 748 - -
1995 - - 514 797 - -
1996 - - 449 148 118 542 -
1997 - - 980 230 125 096 -
1998 - - 804 178 121 895 1 638
1999 194 699 - - 113 568 40 099
2000 - - - 109 289 12 945
2001 - - 1 393 257 115 012 20 006
2002 - 41 066 2 610 740 110 830 191 234
2003 - - 2 095 700 96 778 16 382
2004 - - 2 094 251 103 881 487 447
2005 - - 3 013 787 85 023 985 094
2006 - 50 684 2 634 386 - 408 425
2007 - - 2 530 808 523 102 1 638 311
2008 - - 2 607 657 - 754 801
2009 - - 2 795 855 - 605 194
2010 - - 2 635 687 - 790 148
2011 - - 2 973 928 - 957 017
2012 - - 2 681 277 673 820 1 589 999
2013 - - 2 501 426 - 594 982
2014 - - - - 590 146
2015 - - - 12 293 254 187 253
File Size Modified Licence
gigaword-1950-59.tar
this file contains a scrambled version of the corpus Information (XML)
92.69 MB 2016-06-07 CC BY 4.0
attribution
gigaword-1960-69.tar
this file contains a scrambled version of the corpus Information (XML)
107.78 MB 2016-06-07 CC BY 4.0
attribution
gigaword-1970-79.tar
this file contains a scrambled version of the corpus Information (XML)
175.03 MB 2016-06-07 CC BY 4.0
attribution
gigaword-1980-89.tar
this file contains a scrambled version of the corpus Information (XML)
217.9 MB 2016-06-07 CC BY 4.0
attribution
gigaword-1990-99.tar
this file contains a scrambled version of the corpus Information (XML)
1.05 GB 2016-06-07 CC BY 4.0
attribution
gigaword-2000-09.tar
this file contains a scrambled version of the corpus Information (XML)
5.48 GB 2016-06-07 CC BY 4.0
attribution
gigaword-2010-15.tar
this file contains a scrambled version of the corpus Information (XML)
4.32 GB 2016-06-07 CC BY 4.0
attribution

Type

  • Corpus

Language

Swedish

Size

Sentences: 59,736,642
Tokens: 1,015,635,151

Contact

Språkbanken
sb-info@svenska.gu.se