Canceled:

The workshop got canceled and will be rearranged in the near future.

Description

The present Workshop/Shared Task seeks to transform a large set of digitized publications describing the grammars of the languages of the world into structured databases that will enable comparison of different languages at an unprecedented breadth and depth. There are some 6,500 languages in the world and information about their grammatical characteristics is available in book-form for over 4,000 of them. Until recently, extraction of information from grammars has been done exclusively through manual collection. This procedure is naturally bounded by the limits of human capacities, and as such can only target a relatively small amount of languages/characteristics at a substantial time investment in a given time. We are now entering a phase where it is practical to use NLP tools for a number of similar tasks. A computer may minimally infer some characteristics of the language described simply by counting words used in a grammatical description, e.g., a high-frequency of the term ’suffix’ likely indicates that the language being described uses a lot of suffixes. Further, there are less straightforward or more detailed characteristics traditionally of interest to linguists, such as where the verb is placed in then sentence (beginning, middle, end), the existence and use of participles, possessive constructions, evidentiality and so on. Any techniques from the NLP toolbox such as td-idf-weighting, tagging, parsing and vector spaces may be used in combination and as input in more sophisticated Machine Learning approaches. In this shared task we provide a subset of the World Atlas of Language Structures (WALS, http://wals.info) along with the digitized sources from which the features were drawn. Sources are provided in raw text form. The task is to infer WALS datapoints from the raw text data of the digitized grammatical descriptions.

Task

The task is to provide the Value for an unseen Language-Feature-Source triple. No language-specifc data source external to the training data (such as the classifcation of a language, other sources for a language etc.) may be used. However, other open generic linguistic data sources may be utilized (such as the raw text of the corresponding WALS chapter, a list of linguistic terms etc.).

Data

10 000 datapoints spanning 191 languages and 100 features along with their value and source(s) are given as training in the following form:

Languages ISO 639-3 Feature Value Source
Macushi mbc 31A Sex-based and Non-sex-based Gender Systems Non-sex-based Abbott-1991[105-106]
Macushi mbc 57A Position of Pronominal Possessive Affixes Possessive prefixes Abbott-1991[85,101];
Williams-1932[61];
Carson-1982[104-106]
East Oromo hae 118A Predicative Adjectives Mixed Owens-1985
East Oromo hae 9A The Velar Nasal No velar nasal Owens-1985[10]

Features and values are defined as per WALS (http://wals.info). Sources are semi-colon separated and optionally indicate a page range in square brackets. Each source maps uniquely to an entry with bibliographical details in a bibtex-file and to a full-text of the source in question. The full-text is an OCR of a scan of the original source (varying quality) and contains no formatting. OCR errors are present, especially for IPA- or non-ASCII-script text in a vernacular. There is a total of 443 source texts supplied. The training data can be downloaded at (http://stp.lingfil.uu.se/~harald/grammar-data-mining.zip).

Submission Instructions and Dates

Authors should submit a paper of up to 8 pages conforming to the RANLP style guidelines (see http://lml.bas.bg/ranlp2019/submissions.php) describing their technical solution to the specific task. The submission should contain a link to a runnable version (e.g. on github.com) of the authors’ solution. This runnable should output a Value (and nothing else) upon running the system: e.g. Given a language-code, the feature of interest, and the source document, the system should output the feature value as examplified below:


>>>python grammar-data-mining.py "hae" "118A Predicative Adjectives" "Owens-1985"
Mixed

  • Submission is electronic, using the Softconf submission system for the Grammar Data Mining Workshop
    at https://www.softconf.com/ranlp2019/GDM/
  • Papers must be written in English.
  • Submitted papers will be peer-reviewed by three experts from a related field.
  • At least one author of each accepted paper is required to register for the RANLP 2019 conference, attend the workshop,
    and present the paper.

Paper Submission Date
Workshop paper submission deadline: 28 July 2019
Workshop paper acceptance notification: 7 Aug 2019
Workshop paper camera-ready version: 20 August 2019
Workshop: 5-6 September 2019

Evaluation

Each submission will be evaluated against a test set of 1000 random datapoints drawn from the same origin as the training data set. Other aspects (such as running time) will not be evaluated.

Program Committee

  • Erich Round Senior Lecturer School of Languages and Cultures, The University of Queensland Australia
  • Guillaume Segerer Researcher in Linguistics (CNRS, LLACAN), France
  • Harald Hammarström Department of Linguistics and Philology Uppsala University, Sweden
  • Markus Forsberg Researcher, Språkbanken, University of Gothenburg, Sweden
  • Sebastian Nordhoff Language Science Press Unter den Linden 6 10099 Berlin Germany
  • Søren Wichmann Researcher, Leiden University Centre for Linguistics, Netherlands
  • Shafqat Mumtaz Virk Researcher, Språkbanken, University of Gothenburg, Sweden
  • Zeljko Agic Associate Professor in Computer Science, IT University of Copenhagen, Denmark

Organizing Committee

Name Link Email
Harald Hammarström http://stp.lingfil.uu.se/~harald/ harald.hammarstrom@lingfil.uu.se
Markus Forsberg https://spraakbanken.gu.se/swe/personal/markus markus.forsberg@svenska.gu.se
Shafqat Mumtaz Virk https://spraakbanken.gu.se/swe/personal/shafqat shafqat.virk@svenska.gu.se
Søren Wichmann https://soerenwichmann.com/ wichmannsoeren@gmail.com
Guillaume Segerer http://guillaumesegerer.fr guillaume.segerer@cnrs.fr
Erich Round https://languages-cultures.uq.edu.au/profile/1160/erich-round e.round@uq.edu.au
Sebastian Nordhoff http://langsci-press.org/about sebastian.nordhoff@langsci-press.org