Skip to main content

What are corpus sentence sets?

For many of Språkbanken's corpora downloadable sentence sets are available. The sentence sets are collections of sentences from each corpus, with automatic annotation of, e.g., part-of-speech and syntactic structure. The sentences have been scrambled for copyright reasons and thus appear in a randomized order, so that the original texts cannot be recreated.

This is approximately what Sparv's "standard" XML format for (sentence sets) looks like:

<text>
 <paragraph>
   <sentence>
     <token _tail="\s">text</token>
   </sentence>
 </paragraph>
</text>

Notes:

  • Not all corpora have paragraphs.
  • Older corpora (those that were annotated with versions of Sparv older than 4.0) have <w> tags instead of <token> and they never contain _tail attributes.
  • The _tail attribute inside <token> holds information about the whitespaces (spaces "\s", tabs "\t" or newlines "\n") that follow that token in the source material.
  • In addition to the tags and attributes shown above there can be any amount of additional tags and attributes in the data, depending on what the input looked like and what annotations were added by Sparv.