Stockholm Umeå Corpus Version 2.0, SUC 2.0

This is the second edition of the Stockholm-Umeå Corpus, Version 2.0, SUC 2.0. The SGML annotated corpus is distributed in 3 formats. The 2.0c version of the corpus has morphosyntactic descriptions in SUC format, while the 2.0d version is in PAROLE format. The third format has SUC morphosyntactic descriptions, and elaborate structural markup but the files lack bibliographic headers. They are SGML-conformant but not TEI-conformant documents, and bibliographic information must be provided in a separate file.

The first edition of the complete Stockholm-Umeå Corpus (SUC 1.0) was distributed in 1997. A subset of the annotated SUC corpus of approximately 300 000 words (swe01), created october 31, 1992, was distributed in 1994 as a part of the ACL European Corpus Initiative.

The number of bytes of the SUC format of the corpus (2.0c) is 59690705. A punctuation is defined as a token that is assigned part of speech F in PAROLE format, respectively MAD, MID or PAD in SUC format. A word is defined as a token that is not a punctuation token.

Manual of the Stockholm Umeå Corpus version 2.0

Description of the content of the SUC 2.0 distribution, including the unfinished documentation by Gunnel Källgren:

Tag formats explained:


© 2002 Dept of Lingustics, Stockholm University, and Dept of Linguistics, Umeå University

Responsibility Responsible
Principal Eva Ejerhed, Umeå University (UmU)
Funder HSFR (Swedish Council for Research in the Humanities and Social Sciences)
Funder STU/NUTEK (The Swedish National Board for Industrial and Technical Development)
Funder The Faculty of Humanities, Umeå University
Project management at SU and UmU respectively Gunnel Källgren, Eva Ejerhed
Compilation of corpus and text type taxonomy Gunnel Källgren
Creation of the SUC tagset for morphosyntactic descriptions Eva Ejerhed
Data acquisition and legal agreements Gunnel Källgren
English translation of legal agreements Teresa Bjelkhagen
Bibliography for SUC texts Britt Hartmann
Programming (SU) Gunnar Eriksson, Sune Magnberg
Selection of text samples and creation of raw text Gunnel Källgren, Britt Hartmann
Preprocessing texts for manual annotation, assigning lexical analyses to word tokens Eva Ejerhed and Magnus Åström in collaboration with Fred Karlsson, University of Helsinki
Programming library for SUC format in SUC 1.0, corpus production tools Magnus Åström
Manual annotation,morphosyntactic descriptions SU : Janne Lindberg, Cecilia Lyckow, Ulrika Kvist, Svensson-Lindberg
UmU: Joana Arnesson, Eva Ejerhed, Ola Wennstedt, Anna-Lena Wiklund
Post-processing manually annotated text Eva Ejerhed, Joana Arnesson, Anna-Lena Wiklund, Åström, Fredrick Backman, Rolf Sandberg
Preparing SUC 1.0 for distribution Eva Ejerhed, Magnus Åström, Fredrick Backman, Arnholm, in collaboration with Daniel Ridings and Pernilla Danielsson, Gothenburg University
Construction of the SGML (TEI) tag set for SUC 2.0 Gunnar Eriksson, Gunnel Källgren
DTD for manual SGML-markup of SUC 2.0 Gunnar Eriksson
manual SGML markup Maria Arnstad, Harald Berthelsen, Christina Ericsson, Malin Ericson, Tove Gerholm, Sofia Gustafson-Capkova, Sara Rydin
Management of hard copies Britt Hartmann
Project management in Stockholm 1999-2001 Benny Brodda, Sofia Gustafson-Capkova
Artistic design of CD-cover for SUC 2.0 Ulrika Kvist Darnell
Preparation and compiling in Corpus Workbench Sofie Johansson Kokkinakis, Språkbanken, University of Gothenburg
Web-interface and search routines Torgny Rasmark

Uppdaterad .