SUC II - materialbeskrivning

Stockholm Umeå Corpus Version 2.0, SUC 2.0

This is the second edition of the Stockholm-Umeå Corpus, Version 2.0, SUC 2.0. The SGML annotated corpus is distributed in 3 formats. The 2.0c version of the corpus has morphosyntactic descriptions in SUC format, while the 2.0d version is in PAROLE format. The third format has SUC morphosyntactic descriptions, and elaborate structural markup but the files lack bibliographic headers. They are SGML-conformant but not TEI-conformant documents, and bibliographic information must be provided in a separate file.

The first edition of the complete Stockholm-Umeå Corpus (SUC 1.0) was distributed in 1997. A subset of the annotated SUC corpus of approximately 300 000 words (swe01), created october 31, 1992, was distributed in 1994 as a part of the ACL European Corpus Initiative.

The number of bytes of the SUC format of the corpus (2.0c) is 59690705. A punctuation is defined as a token that is assigned part of speech F in PAROLE format, respectively MAD, MID or PAD in SUC format. A word is defined as a token that is not a punctuation token.

Manual of the Stockholm Umeå Corpus version 2.0

Description of the content of the SUC 2.0 distribution, including the unfinished documentation by Gunnel Källgren:
SUC2.0-manual.pdf

Tag formats explained:

PAROLE, SUC

Responsibility	Responsible
Principal	Eva Ejerhed, Umeå University (UmU)
Funder	HSFR (Swedish Council for Research in the Humanities and Social Sciences)
Funder	STU/NUTEK (The Swedish National Board for Industrial and Technical Development)
Funder	The Faculty of Humanities, Umeå University
Project management at SU and UmU respectively	Gunnel Källgren, Eva Ejerhed
Compilation of corpus and text type taxonomy	Gunnel Källgren
Creation of the SUC tagset for morphosyntactic descriptions	Eva Ejerhed
Data acquisition and legal agreements	Gunnel Källgren
English translation of legal agreements	Teresa Bjelkhagen
Bibliography for SUC texts	Britt Hartmann
Programming (SU)	Gunnar Eriksson, Sune Magnberg
Selection of text samples and creation of raw text	Gunnel Källgren, Britt Hartmann
Preprocessing texts for manual annotation, assigning lexical analyses to word tokens	Eva Ejerhed and Magnus Åström in collaboration with Fred Karlsson, University of Helsinki
Programming library for SUC format in SUC 1.0, corpus production tools	Magnus Åström
Manual annotation,morphosyntactic descriptions	SU : Janne Lindberg, Cecilia Lyckow, Ulrika Kvist, Svensson-Lindberg UmU: Joana Arnesson, Eva Ejerhed, Ola Wennstedt, Anna-Lena Wiklund
Post-processing manually annotated text	Eva Ejerhed, Joana Arnesson, Anna-Lena Wiklund, Åström, Fredrick Backman, Rolf Sandberg
Preparing SUC 1.0 for distribution	Eva Ejerhed, Magnus Åström, Fredrick Backman, Arnholm, in collaboration with Daniel Ridings and Pernilla Danielsson, Gothenburg University
Construction of the SGML (TEI) tag set for SUC 2.0	Gunnar Eriksson, Gunnel Källgren
DTD for manual SGML-markup of SUC 2.0	Gunnar Eriksson
manual SGML markup	Maria Arnstad, Harald Berthelsen, Christina Ericsson, Malin Ericson, Tove Gerholm, Sofia Gustafson-Capkova, Sara Rydin
Management of hard copies	Britt Hartmann
Project management in Stockholm 1999-2001	Benny Brodda, Sofia Gustafson-Capkova
Artistic design of CD-cover for SUC 2.0	Ulrika Kvist Darnell
Preparation and compiling in Corpus Workbench	Sofie Johansson Kokkinakis, Språkbanken, University of Gothenburg
Web-interface and search routines	Torgny Rasmark

Uppdaterad .