Grammar Data Mining:

Grammar Data Mining

Automatic Extraction of Linguistic Features from Descriptive Grammars
Grammar Data Mining

Automatic Extraction of Linguistic Features from Descriptive Grammars

Canceled:

The workshop got canceled and will be rearranged in the near future.

Description

The present Workshop/Shared Task seeks to transform a large set of digitized publications describing the grammars of the languages of the world into structured databases that will enable comparison of diﬀerent languages at an unprecedented breadth and depth. There are some 6,500 languages in the world and information about their grammatical characteristics is available in book-form for over 4,000 of them. Until recently, extraction of information from grammars has been done exclusively through manual collection. This procedure is naturally bounded by the limits of human capacities, and as such can only target a relatively small amount of languages/characteristics at a substantial time investment in a given time. We are now entering a phase where it is practical to use NLP tools for a number of similar tasks. A computer may minimally infer some characteristics of the language described simply by counting words used in a grammatical description, e.g., a high-frequency of the term ’suffix’ likely indicates that the language being described uses a lot of suffixes. Further, there are less straightforward or more detailed characteristics traditionally of interest to linguists, such as where the verb is placed in then sentence (beginning, middle, end), the existence and use of participles, possessive constructions, evidentiality and so on. Any techniques from the NLP toolbox such as td-idf-weighting, tagging, parsing and vector spaces may be used in combination and as input in more sophisticated Machine Learning approaches. In this shared task we provide a subset of the World Atlas of Language Structures (WALS, http://wals.info) along with the digitized sources from which the features were drawn. Sources are provided in raw text form. The task is to infer WALS datapoints from the raw text data of the digitized grammatical descriptions.

Task

The task is to provide the Value for an unseen Language-Feature-Source triple. No language-specifc data source external to the training data (such as the classifcation of a language, other sources for a language etc.) may be used. However, other open generic linguistic data sources may be utilized (such as the raw text of the corresponding WALS chapter, a list of linguistic terms etc.).

Data

10 000 datapoints spanning 191 languages and 100 features along with their value and source(s) are given as training in the following form:

Languages	ISO 639-3	Feature	Value	Source
Macushi	mbc	31A Sex-based and Non-sex-based Gender Systems	Non-sex-based	Abbott-1991[105-106]
Macushi	mbc	57A Position of Pronominal Possessive Affixes	Possessive prefixes	Abbott-1991[85,101]; Williams-1932[61]; Carson-1982[104-106]
East Oromo	hae	118A Predicative Adjectives	Mixed	Owens-1985
East Oromo	hae	9A The Velar Nasal	No velar nasal	Owens-1985[10]

Features and values are defined as per WALS (http://wals.info). Sources are semi-colon separated and optionally indicate a page range in square brackets. Each source maps uniquely to an entry with bibliographical details in a bibtex-file and to a full-text of the source in question. The full-text is an OCR of a scan of the original source (varying quality) and contains no formatting. OCR errors are present, especially for IPA- or non-ASCII-script text in a vernacular. There is a total of 443 source texts supplied. The training data can be downloaded at (http://stp.lingfil.uu.se/~harald/grammar-data-mining.zip).

Submission Instructions and Dates

Authors should submit a paper of up to 8 pages conforming to the RANLP style guidelines (see http://lml.bas.bg/ranlp2019/submissions.php) describing their technical solution to the specific task. The submission should contain a link to a runnable version (e.g. on github.com) of the authors’ solution. This runnable should output a Value (and nothing else) upon running the system: e.g. Given a language-code, the feature of interest, and the source document, the system should output the feature value as examplified below:

>>>python grammar-data-mining.py "hae" "118A Predicative Adjectives" "Owens-1985"
Mixed

Submission is electronic, using the Softconf submission system for the Grammar Data Mining Workshop
at https://www.softconf.com/ranlp2019/GDM/
Papers must be written in English.
Submitted papers will be peer-reviewed by three experts from a related field.
At least one author of each accepted paper is required to register for the RANLP 2019 conference, attend the workshop,
and present the paper.

Paper Submission	Date
Workshop paper submission deadline:	28 July 2019
Workshop paper acceptance notification:	7 Aug 2019
Workshop paper camera-ready version:	20 August 2019
Workshop:	5-6 September 2019

Evaluation

Each submission will be evaluated against a test set of 1000 random datapoints drawn from the same origin as the training data set. Other aspects (such as running time) will not be evaluated.

Program Committee

Erich Round Senior Lecturer School of Languages and Cultures, The University of Queensland Australia
Guillaume Segerer Researcher in Linguistics (CNRS, LLACAN), France
Harald Hammarström Department of Linguistics and Philology Uppsala University, Sweden
Markus Forsberg Researcher, Språkbanken, University of Gothenburg, Sweden
Sebastian Nordhoﬀ Language Science Press Unter den Linden 6 10099 Berlin Germany
Søren Wichmann Researcher, Leiden University Centre for Linguistics, Netherlands
Shafqat Mumtaz Virk Researcher, Språkbanken, University of Gothenburg, Sweden
Zeljko Agic Associate Professor in Computer Science, IT University of Copenhagen, Denmark

Organizing Committee

Name	Link	Email
Harald Hammarström	http://stp.lingfil.uu.se/~harald/	harald.hammarstrom@lingfil.uu.se
Markus Forsberg	https://spraakbanken.gu.se/swe/personal/markus	markus.forsberg@svenska.gu.se
Shafqat Mumtaz Virk	https://spraakbanken.gu.se/swe/personal/shafqat	shafqat.virk@svenska.gu.se
Søren Wichmann	https://soerenwichmann.com/	wichmannsoeren@gmail.com
Guillaume Segerer	http://guillaumesegerer.fr	guillaume.segerer@cnrs.fr
Erich Round	https://languages-cultures.uq.edu.au/profile/1160/erich-round	e.round@uq.edu.au
Sebastian Nordhoﬀ	http://langsci-press.org/about	sebastian.nordhoff@langsci-press.org

Grammar Data Mining

Automatic Extraction of Linguistic Features from Descriptive Grammars