• Hem
  • QPL: Querying linguistic corpora with Prolog
Webbkarta

QPL: Querying linguistic corpora with Prolog

Gerlof Bouma, University of Gothenburg

This page is work in progress, but it will contain Prolog code and conversion scripts to query linguistically annotated corpora. The general ideas are explained and illustrated in the paper duo Bouma (2010a, LAW) and Bouma (2010b, KONVENS). As yet, the code is divided after the corpus it is to be used with, although there is considerable overlap in the code.

Note that the corpora themselves (generally) are not included: you will have acquire those yourself and convert them to Prolog with the conversion scripts provided. Also, the size and the type of annotated data mentioned below refers to the data made available for querying with Prolog.

There is also the beginning of a bibliography of papers using Prolog in similar ways.

Code

Spoken Dutch Corpus CGN

LanguageDutch
Mod./genreSpoken, mixed genres
Size∼1mln tokens in 130k segments with syntactic annotation in v2.0
AnnotationSyntax: discontinuous phrase structure and edge labels.
Distributionat Centrale voor Taal- en Spraaktechnologie.
NoteIncludes code to (almost) re-create the data used in Bouma (2008).
Code[broken link][as gzipped tar archive]

TIGER Corpus

LanguageGerman
Mod./genreNewspaper text
Size50k sentences in v2.1
AnnotationSyntax: discontinuous phrase structure and edge labels.
Distributionat IMS/Uni Stuttgart.
Code[as gzipped tar archive]

Tüba-D/Z Corpus

LanguageGerman
Mod./genreNewspaper text
Size45k sentences in v5
AnnotationSyntax: topological fields, phrase structure and edge labels. Anaphora.
Distributionat Uni TĂŒbingen.
NoteThis is the code used in Bouma (2010a, 2010b)
Code[as gzipped tar archive]

Four languages from Europarl v3

LanguageDutch, English, German & Swedish
Mod./genre(Translated?) Minutes of parliamentary sessions.
SizeUp to 1.5mln sentences per language.
AnnotationSyntax: dependency structure.
Distributionat statm.org for the underlying corpus. The re-tokenization used here is available from my site.
NoteThis is the corpus and code used in Bouma, et al. (2010a). The retokenized and parsed corpus itself is also available here.
Code & data[broken link][as bzip2-ed tar archive - 800MBytes]

Bibliography

I'm also compiling a bibliography of papers that describe — or significantly mention use of — Prolog as a corpus querying and transformation tool

Bouma, Gerlof. 2008. Starting a sentence in Dutch: A corpus study of subject- and object-fronting. Groningen Dissertations in Linguistics 66, Center for Language and Cognition, University of Groningen.

Bouma, Gerlof. 2010a. Syntactic tree queries in Prolog. In: Proceedings of the Fourth Linguistic Annotation Workshop, ACL 2010, pp212–216, Uppsala. [pdf in ACL anthology]

Bouma, Gerlof. 2010b. Querying Linguistic Corpora with Prolog. In Pinkal, Rehbein, Schulte im Walde & Storrer (eds), Semantic Approaches in Natural Language Processing: Proceedings of the Conference on Natural Language Processing 2010 (KONVENS 2010), Saarbrücken, Universaar.[final draft with a slightly spacier layout]

There's a small bug in the anaphora annotation transformation code in the paper. Fixed in the online code

Bouma, Gerlof, Lilja Øvrelid & Jonas Kuhn. 2010. Towards a Large Parallel Corpus of Cleft Constructions. In: Proceedings of LREC 2010, pp3585–3592. Malta. [abstract & link to pdf]

Lally, Adam and Paul Fodor. 2011. Natural Language Processing With Prolog in the IBM Watson System. ALP Newsletter, March 2011.

See the notes at the bottom -- a longer paper is supposedly in the works.

Schneiker, Christian, Dietmar Seipel & Werner Wegstein. 2009. Schema and Variation: Digitizing Printed Dictionaries. In: Proceedings of the Third Linguistic Annotation Workshop, ACL-IJCNLP 2009, pp82–89, Singapore. [pdf in ACL anthology]

Seeker, Wolfgang & Jonas Kuhn. 2012. Making Ellipses Explicit in Dependency Conversion for a German Treebank. Proceedings of the 8th International Conference on Language Resources and Evaluation, pp3132—3139, Istanbul, Turkey. [pdf at LREC]

Witt, Andreas. 2005. Mutiple hierarchies: New aspects of an old solution. In: Dipper, Götze & Stede (eds), Heterogenity in Focus: Creating and Using Linguistic Databases (ISIS 2), pp55–86, Potsdam: Universtitätsverlag Potsdam. [pdf of ISIS 2]

Zarrieß, Sina, Aoife Cahill & Jonas Kuhn. 2012. To what extent does sentence-internal realisation reflect discourse context? A study on word order In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2012). [pdf in ACL anthology]

© Göteborgs universitet 2009, Box 100, 405 30 Göteborg
Tel +46 31 786 0000, Kontakt

Om webbplatsen

X
Laddar