This document describes the XML format you should use when importing text using the corpus pipeline.
The input text files must meet the following criteria:
<text>, containing nothing but raw text.
Any metadata about the text as a whole or parts of the text (e.g. chapters, sentences, words) must exist in the form of attributes to the elements containing the text. For example:
<text title="Title metadata here"> This is the corpus text. </text>
All text, except for element attributes, will be part of the corpus text, and will not be treated as metadata. The following example is not a valid way to set title metadata for a text, and the string "Title metadata here" will be treated as corpus text:
<text> <title>Title metadata here</title> This is the corpus text. </text>
One exception to the above is when there is a header in the file, such as:
<text> <header> <title>Title metadata here</title> </header> This is the corpus text. </text>
In this case it is possible to treat the whole
header element (not necessarily named
header) as not belonging to the text, and instead extract metadata from it. The
metadata will be tied to all text contained by the root element. In this case the root element is
text, and the result will be the same as in the first example.
The following is not a valid example where you can extract header metadata, due to the header being disconnected from the text:
<header> <title>Title metadata here</title> </header> <text> This is the corpus text. </text>
However, had both
text been contained within another element, header metadata extraction would have been possible.
When using the corpus pipeline you configure your Makefile to match your XML format, and tell it what elements and attributes to parse. The variables used for this is the following:
xml_elements you list (in lowercase, separated by whitespace) the elements of interest, optionally with their attributes (on the form
xml_annotations is a list of the same size, with every item corresponding to an item in
xml_elements. This list contains the names of
the annotation files that will be created during parsing. For example:
xml_elements = text:title text:author s xml_annotations = text.title text.author sentence
The above will parse elements such as
<text title="X" author="Y"> and save title and author as separate annotations. It will also parse the
<s>, not saving any attribute data but only the boundaries of the element, marking the beginning and end of every sentence (provided
that our example text has sentence segmentation marked with
In this way you can use existing annotations of a material during the import, rather than letting the pipeline create new annotations for you.
By saving the
<s> element as the annotation named
sentence, the pipeline will not try to create that annotation itself. There are several
hardcoded annotation names, all of which can be overridden like
Positional: token.pos, token.msd, token.baseform, token.lemgram, token.saldo, token.prefix, token.suffix Structural: token, sentence, paragraph
For positional annotations (annotations for single tokens), use the prefix "token." as shown above.
You can combine several elements into one annotation, if for example sentence segmentation is marked up with
<s> in some places and
<x> in some.
Combining them is done by simply joining both elements with a "+":
xml_elements = s+x xml_annotations = sentence
For files containing headers like in the previously shown example, three variables has to be set in the Makefile:
xml_header xml_headers xml_header_annotations
The first one should be set to the name of the header element, in lowercase. This tells the parser that this element is a header, and none of its text content will be part of the corpus text.
xml_header = header
Alternatively you can give the full path to the header, joining elements by ".":
xml_header = text.header
The second and third variable is similar to the
xml_annotations pair. The first one is a list of headers to parse,
and the second a list of corresponding annotation files that will be created:
xml_headers must contain the full path to the element in question, in lowercase. As with
xml_elements you can specify attributes using a colon, but it is
also possible to parse the text contained by the element, by using the keyword "TEXT" instead of an attribute:
xml_headers = text.header.title:TEXT xml_header_annotations = text.title
Any periods in element names have to be escaped using backslash.
The following is an example input file, with existing sentence segmentation and tokenization. Every token is annotated with its baseform.
<text title="Test"> <s> <w bf="it">It</w> <w bf="be">was</w> <w bf="an">an</w> <w bf="elephant">elephant</w> <w bf=".">.</w> </s> </text>
The configuration of the Makefile will look like this:
xml_elements = text:title s w w:bf xml_annotations = text.title sentence token token.baseform