We are committed to SGML and the Text Encoding Initiative [Sperberg-McQueen and Burnard1994] for the annotation of our corpus. This is a decision that was initially motivated by a polite wish to follow international standards, but is now based on the firm conviction that this is exactly the way we want to do it. We have made changes to the TEI, but they are in conformance with the procedures outlined in Chapter 29 of the ``Green Book'' [Sperberg-McQueen and Burnard1994, 737-744,]. A consequence of our commitment to SGML and TEI is that we do not regard the mark-up in our corpus as merely providing a means for document exchange; it is the only format we use. There is, in other words, no difference between ``local'' format and ``exchange'' format.
SGML and TEI have often been criticized for being verbose and causing extra, expensive, efforts on the part of those who acquire marked-up texts. It's verbosity, it has been argued, takes a lot of disk space and makes it difficult to work with texts on the screen. The same verbosity is also blamed for the expensive efforts some sites claim that they must make in order to get back to ``the original text.'' SGML tagged material will take a lot more disk space; information does not come for free, in contrast to disks, whose prices have fallen drastically over the last 10 years. In a raw state it does look terrible on the screen, but this is the age of computers and we should let programs work with our data. When we do want to present it on the screen all we have to do is get one of the many browsers on the market that understand the mark-up. The last point, that it is expensive to remove the mark-up, is probably a conclusion drawn by someone who has not really tried. The following one-line program in AWK will print only the text data from a text that has been tagged and parsed:
/^-/ {print}If the result is the ``original text'' is a philosophical question rather than technical.
The full details of our annotation will be dealt with in a future report. For the present we will gloss over the aspects which keep closely to the standard, since such details can be culled from the ``Green Book.'' This will not, however, prevent us from taking one or two detours to the dark side of the moon in an attempt to clarify why we feel so strongly about our choice. It should be said at this point that we have been strengthened in our convictions by the excellent package consisting of an API and tools produced by Henry Thompson, Steve Finch and David McKelvie nsl0.8beta for the MULTEXT project.
Our texts are stored as a TEI corpus. PEDANT contains first a corpus header describing our project, followed by the individual texts, each with their own header information. Conceptually PEDANT is one single document. The basic structure is the following:
<!DOCTYPE corpus.2 SYSTEM "mar2.dtd" [ >] <corpus.2> <teiHeader type=corpus> ... </teiHeader> <tei.2> <teiHeader type=text> ... </teiHeader> <text> ... </text> </tei.2> <tei.2> <teiHeader type=text> ... </teiHeader> <text> ... </text> </tei.2> </corpus.2>
It should be noted that the corpus is the base for our work. All annotation, such as POS information, lemma information, etc is entered into the corpus. Each text, each paragraph, each word, every token is given its unique identification number. Those who find the clutter of SGML distressing will not (yet) be consoled by what they see. An example of how a Swedish text looks that has been annotated to the word level can be seen in figure 4.
<TEXT ID=MADSE LANG=SE> <BODY> <P ID=MADSE.1> <w id=w1 ana=a0pnsnd lemma=europeisk><c id=c1.t1>EUROPEISKA</c></w> <w id=w2 ana=ncnsn0d lemma='råd'><c id=c1.t2>RÅDET</c></w> <w id=w3 ana=sps lemma=i><c id=c1.t3>I</c></w> <w id=w4 ana=np00000 lemma=madrid><c id=c1.t4>MADRID</c></w> </P> <P ID=MADSE.2> <w id=w5 ana=mc0000 lemma=femton type=num><c id=c2.t1>15</c></w> <w id=w6 ana=cc lemma=och><c id=c2.t2>OCH</c></w> <w id=w7 ana=mc0000 lemma=sexton type=num><c id=c2.t3>16</c></w> <w id=w8 ana=np0000 lemma=december><c id=c2.t4>DECEMBER</c></w> <w id=w9 ana=mc0000 lemma=nittonhundranittiofem type=num><c id=c2.t5>1995</c></w> </P> . . . <P ID=MADSE.6> <w id=w15 ana=sps lemma=vid><c id=c6.t1>Vid</c></w> <w id=w16 ana=pd2nsg0d0 lemma=sin><c id=c6.t2>sitt</c></w> <w id=w17 ana=ncnsn0i lemma='möte'><c id=c6.t3>möte</c></w> <w id=w18 ana=sps lemma=i><c id=c6.t4>i</c></w> <w id=w19 ana=np00000 lemma=madrid><c id=c6.t5>Madrid</c></w> <w id=w20 ana=df0us0 lemma=den><c id=c6.t6>den</c></w> <w id=w21 ana=mc0000 lemma=femton type=num><c id=c6.t7>15</c></w> <w id=w22 ana=cc lemma=och><c id=c6.t8>och</c></w> <w id=w23 ana=mc0000 lemma=sexton type=num><c id=c6.t9>16</c></w> <w id=w24 ana=np0000 lemma=december><c id=c6.t10>december</c></w> <w id=w25 ana=mc0000 lemma=nittonhundranittiofem type=num><c id=c6.t11>1995</c></w> <w id=w26 ana=va0p lemma=ha><c id=c6.t12>har</c></w> <w id=w27 ana=a0pnsnd lemma=europeisk><c id=c6.t13>Europeiska</c></w> <w id=w28 ana=ncnsn0d lemma='råd'><c id=c6.t14>rådet</c></w> <w id=w29 ana=vm0s lemma=anta><c id=c6.t15>antagit</c></w> <w id=w30 ana=ncnpn0i lemma=beslut><c id=c6.t16>beslut</c></w> <w id=w31 ana=sps lemma=om><c id=c6.t17>om</c></w> <w id=w32 ana=ncusn0d lemma='syssels{\"a}ttning'><c id=c6.t18>syssels{\"a}ttningen</c></w> <w id=w33 ana=fi type=punct><c id=c6.t19>,</c></w> <w id=w34 ana=df0us0 lemma=den><c id=c6.t20>den</c></w> <w id=w35 ana=a0p0pnd lemma=gemensam><c id=c6.t21>gemensamma</c></w> <w id=w36 ana=ncusn0d lemma=valuta><c id=c6.t22>valutan</c></w> <w id=w37 ana=fi type=punct><c id=c6.t23>,</c></w>Figure 4: This is a sample of our base text, that is a text with high-density annotation at the word level.
In figure 4 one can see that the text has received identification number MADSE with an attribute identifying it as being a text in Swedish, LANG=SE. The paragraphs, words and tokens have also been assigned identification numbers and at times attributes, such as type=num or type=punct.
The observant reader will notice that there are no ``s-units'' in the mark-up in figure 4, and our alignment program works with ``s-units'', not words. The mark-up could be added at this level, but we are not yet sure if it should be or not. We are at this point following the data architecture as conceived by the MULTEXT project, where all low-density annotation is expressed indirectly in terms of links. Low-density annotation in our work is everything above the word level, with the exception of <p> elements. As mentioned above, in section 3.5, our alignment method is based on Church and Gale GaleChurch93 and has little use for words (at this point). Therefore we have a separate annotation document for s-units. Such a document, pointing back into the same text as in figure 4 can be seen in figure 5.
<TEXT ID=PSE1 LANG=SE> <BODY> <P ID=MADSE.1><S TO='id (W4)' FROM='id (W1)' DOC=dd ID=MADSE.1.1></S></P> <P ID=MADSE.2><S TO='id (W9)' FROM='id (W5)' ID=MADSE.2.1></S></P> <P ID=MADSE.3><S TO='id (W11)' FROM='id (W10)' ID=MADSE.3.1></S></P> <P ID=MADSE.4><S TO='id (W13)' FROM='id (W12)' ID=MADSE.4.1></S></P> <P ID=MADSE.5><S FROM='id (W14)' ID=MADSE.5.1></S></P> <P ID=MADSE.6><S TO='id (W47)' FROM='id (W15)' ID=MADSE.6.1></S></P>Figure: 5 The same text that is found in figure 4. The sentences contain hyperlinks back to the individual words.
The <text>, <body> and <p> elements are the same in both figures. The attribute DOC in the first <p> element is an SGML entity whose value points to the file containing the data in figure 4. The first <S> element is made up of words starting from id=w1 and ending with the word with id=w4. The second <s> element consists of words with id's from w5 to w9 and so on. We are presently working on adding one more attribute to each sentence, n=x, where x is the length of the sentence. When that has been done the document illustrated in figure 5 will be all that is needed for the initial alignment at the ``sentence'' level. The real advantage of this will be that we will end up with a program for alignment that understands SGML and can consequently be applied even outside our local environment by others who are following the same standard.
It should be added that the text illustrated in figure 5 can be expanded easily into a document containing all the mark-up and data, resolving the hyperlinks in each <s> element so that the actual words are included, by running it through one of the programs, nsgml, developed by the MULTEXT project [Thompson et al.1995, 9-10,]. To illustrate this point we will use the next level of annotation, that of translations pairs.
We take this method a step further. We do not want to include alignment information in our base corpus document. Using the same technique as described concerning sentences we create another document consisting of nested <SEG> elements. The ``outer'' <SEG> element contains an alignment pair, two ``inner'' <SEG>'s, one for each language. These ``inner'' elements each contain hyperlinks pointing back to the sentences described in figure 5. The alignment information for the same section of text dealt with in this paper can be seen in figure 6.
<P ID=MADSE.1> <SEG ID=1 DOC=se> <SEG FROM='id (MADSE.1.1)' DOC=se ID=Se1></SEG> <SEG FROM='id (MADEN.1.1)' DOC=en ID=En1></SEG> </SEG> </P> <P ID=MADSE.2> <SEG ID=2 DOC=se> <SEG FROM='id (MADSE.2.1)' DOC=se ID=Se2></SEG> <SEG FROM='id (MADEN.2.1)' DOC=en ID=En2></SEG> </SEG> </P> <P ID=MADSE.3> <SEG ID=3 DOC=se> <SEG FROM='id (MADSE.3.1)' DOC=se ID=Se3></SEG> <SEG FROM='id (MADEN.3.1)' DOC=en ID=En3></SEG> </SEG> </P> . . . <P ID=MADSE.237> <SEG ID=306 DOC=se> <SEG FROM='id (MADSE.237.1)' DOC=se ID=Se306></SEG> <SEG FROM='id (MADEN.237.1)' TO='id (MADEN.237.2)' DOC=en ID=En306></SEG> </SEG> </P> . . . <P ID=MADSE.249> <SEG ID=321 DOC=se> <SEG FROM='id (MADSE.249.1)' TO='id (MADSE.249.2)' DOC=se ID=Se321></SEG> <SEG FROM='id (MADEN.249.1)' DOC=en ID=En321></SEG> </SEG> </P>Figure 6: This is an alignment of Swedish and English declaration from the Conference of Ministers held in Madrid in December 1995. The excerpts containing <p> idnr:s MADSE.237 and MADSE.249 illustrate 1-2 and 2-1 alignments respectively.
Once again, it must be stressed that the hyperlinks in figure 6 can easily be expanded to contain the sentences, which contain the words with all the rich attributes such as POS, lemma etc that is found in the base corpus. This is not the place to go into great detail, but all this information can be searched on and the processed by tools delivered with NSL or by tools developed with the API of the NSL. The expanded version need never be saved, since they can be created on the fly. It is enough to pipe the expansions into other tools and save only the desired results to disk.
For the sake of illustration, if the links in figure 6 point to a document encoded down to the level of s-unit only, the resolution of paragraph MADSE.237 can be seen in figure 7.
<P ID=MADSE.237> <SEG ID=S306> <SEG ID=SE306> <S ID=MADSE.237.1>Europeiska rådet framhåller det nära förestående undertecknandet av den gemensamma deklarationen om den politiska dialogen mellan Europeiska unionen och Chile, vilken utgör ett viktigt steg mot snabba förhandlingar om ett nytt avtal som i slutändan syftar till en association av politisk och ekonomisk art.</S> </SEG> <SEG ID=EN306> <S ID=MADEN.237.1>It emphasizes that the Joint Declaration on Political Dialogue between the European Union and Chile is to be signed shortly.</S> <S ID=MADEN.237.2>This marks an important step towards the early negotiation of a new agreement directed ultimately at political and economic association.</S> </SEG> </SEG> </P>Figure: 7 This is an expansion of a segment in figure 6 demonstrating a 1-2 relationship.
The linking is done by an external entity as the DOC= attribute:
<!ENTITY se SYSTEM "/Pedant/sgml/test/pedant.nsg" CDATA SGML>The document pointed to is our base document. All other annotation is layered onto it. So if we are interested in POS information or doing lemma searches we would change the external entity to:
<!ENTITY se SYSTEM "/Pedant/sgml/test/pedant.pos.nsg" CDATA SGML>We are now pointing at a document with high-density annotation, but containing the same textual information. The resolved links for the paragraph <P ID=MADSE.1> would appear as follows:
<P ID=MADSE.1> <SEG ID=S1> <SEG ID=SE1> <S ID=MADSE.1.1> <W ID=W1 ANA=a0pnsnd LEMMA=europeisk><C ID=C1.T1>EUROPEISKA</C></W> <W ID=W2 ANA=ncnsn0d LEMMA='råd'><C ID=C1.T2>RÅDET</C></W> <W ID=W3 ANA=sps LEMMA=i><C ID=C1.T3>I</C></W> <W ID=W4 ANA=np00000 LEMMA=madrid><C ID=C1.T4>MADRID</C></W> </S> </SEG> <SEG ID=EN1> <S ID=MADEN.1.1> <C ID=C379.T1>MADRID</C> <C ID=C379.T2>EUROPEAN</C> <C ID=C379.T3>COUNCIL</C> </S> </SEG> </SEG> </P>Figure: 8 This is an expansion of the first s-unit in figure 6.
The query routines are described by Thompson, Finch and McKelvie [20-24]nsl0.8beta. They are aware of the tree-structure of the SGML document. This is easier seen by example. If one were interested in finding all sentences in which the lemma ``europeisk'' was found, the query would look as follows:
$ sggrep ".*/TEXT/BODY/P/S" "S/W[LEMMA=europeisk]/C" "" <aligned-links.sgmThat means ``search from the top of the SGML-tree, down to TEXT, BODY, P, S and display all S-elements that contain words with the lemma europeisk.'' Instead of LEMMA= we could have searched for ANA= inorder to find all sentences and their translation pairs that contained a specified morphosyntactic tag, i.e.:
$ sggrep ".*/TEXT/BODY/P/S" "S/W[ANA=np00000]/C" "" <aligned-links.sgmThis search, which not particularly useful, will find and present all sentences together with their translation pairs that contained proper nouns. The results of these searches are in normalized SGML (nSGML) and can be piped into other nSGML-aware tools that have been developed using the API provided by the MULTEXT project.