The Swedish Culturomics Gigaword Corpus

Data citation

Rødven-Eide, Stian (2016). The Swedish Culturomics Gigaword Corpus (updated: 2016-06-07). [Data set]. Enriched and distributed by Språkbanken. https://doi.org/10.23695/3wmv-1z09

Additional ways to cite the dataset.

One billion Swedish words from 1950 and onwards. Code to extract data from the corpus, as well as usage instructions, can be downloaded from https://svn.spraakbanken.gu.se/sb-arkiv/tools/gigaword/

One billion Swedish words from 1950 and onwards.

Please reference the dataset using the following reference:
Stian Rødven Eide, Nina Tahmasebi, Lars Borin. 2016. The Swedish Culturomics Gigaword Corpus: A One Billion Word Swedish Reference Dataset for NLP

Code to extract data from the corpus, as well as usage instructions, can be downloaded from https://svn.spraakbanken.gu.se/sb-arkiv/tools/gigaword/

Sentences per year for each genre
	fiction	government	news	science	socialmedia
1950	-	420 413	-	-	-
1960	-	424 920	-	-	-
1965	-	-	53 624	-	-
1970	-	459 867	-	-	-
1976	-	-	89 175	-	-
1977	499 030	-	-	-	-
1980	-	534 194	-	-	-
1981	307 597	-	-	-	-
1987	97 398	-	364 226	-	-
1990	-	551 988	-	-	-
1991	330 127	-	-	-	-
1992	-	-	-	44 538	-
1994	-	391 882	1 538 748	-	-
1995	-	-	514 797	-	-
1996	-	-	449 148	118 542	-
1997	-	-	980 230	125 096	-
1998	-	-	804 178	121 895	1 638
1999	194 699	-	-	113 568	40 099
2000	-	-	-	109 289	12 945
2001	-	-	1 393 257	115 012	20 006
2002	-	41 066	2 610 740	110 830	191 234
2003	-	-	2 095 700	96 778	16 382
2004	-	-	2 094 251	103 881	487 447
2005	-	-	3 013 787	85 023	985 094
2006	-	50 684	2 634 386	-	408 425
2007	-	-	2 530 808	523 102	1 638 311
2008	-	-	2 607 657	-	754 801
2009	-	-	2 795 855	-	605 194
2010	-	-	2 635 687	-	790 148
2011	-	-	2 973 928	-	957 017
2012	-	-	2 681 277	673 820	1 589 999
2013	-	-	2 501 426	-	594 982
2014	-	-	-	-	590 146
2015	-	-	-	12 293 254	187 253

Download

File	Size	Modified	Licence
gigaword-1950-59.tar corpus (XML, scrambled)	92.69 MB	2016-06-07	CC-BY-4.0
gigaword-1960-69.tar corpus (XML, scrambled)	107.78 MB	2016-06-07	CC-BY-4.0
gigaword-1970-79.tar corpus (XML, scrambled)	175.03 MB	2016-06-07	CC-BY-4.0
gigaword-1980-89.tar corpus (XML, scrambled)	217.9 MB	2016-06-07	CC-BY-4.0
gigaword-1990-99.tar corpus (XML, scrambled)	1.05 GB	2016-06-07	CC-BY-4.0
gigaword-2000-09.tar corpus (XML, scrambled)	5.48 GB	2016-06-07	CC-BY-4.0
gigaword-2010-15.tar corpus (XML, scrambled)	4.32 GB	2016-06-07	CC-BY-4.0

The Swedish Culturomics Gigaword Corpus

Data citation

Download

Type

Language

Size

Creators

Updated

Contact

DOI