Canceled:The workshop got canceled and will be rearranged in the near future.
The present Workshop/Shared Task seeks to transform a large set of digitized publications describing the grammars of the languages of the world into structured databases that will enable comparison of diﬀerent languages at an unprecedented breadth and depth. There are some 6,500 languages in the world and information about their grammatical characteristics is available in book-form for over 4,000 of them. Until recently, extraction of information from grammars has been done exclusively through manual collection. This procedure is naturally bounded by the limits of human capacities, and as such can only target a relatively small amount of languages/characteristics at a substantial time investment in a given time. We are now entering a phase where it is practical to use NLP tools for a number of similar tasks. A computer may minimally infer some characteristics of the language described simply by counting words used in a grammatical description, e.g., a high-frequency of the term ’suffix’ likely indicates that the language being described uses a lot of suffixes. Further, there are less straightforward or more detailed characteristics traditionally of interest to linguists, such as where the verb is placed in then sentence (beginning, middle, end), the existence and use of participles, possessive constructions, evidentiality and so on. Any techniques from the NLP toolbox such as td-idf-weighting, tagging, parsing and vector spaces may be used in combination and as input in more sophisticated Machine Learning approaches. In this shared task we provide a subset of the World Atlas of Language Structures (WALS, http://wals.info) along with the digitized sources from which the features were drawn. Sources are provided in raw text form. The task is to infer WALS datapoints from the raw text data of the digitized grammatical descriptions.
The task is to provide the Value for an unseen Language-Feature-Source triple. No language-specifc data source external to the training data (such as the classifcation of a language, other sources for a language etc.) may be used. However, other open generic linguistic data sources may be utilized (such as the raw text of the corresponding WALS chapter, a list of linguistic terms etc.).
10 000 datapoints spanning 191 languages and 100 features along with their value and source(s) are given as training in the following form:
|Macushi||mbc||31A Sex-based and Non-sex-based Gender Systems||Non-sex-based||Abbott-1991[105-106]|
|Macushi||mbc||57A Position of Pronominal Possessive Affixes||Possessive prefixes||
|East Oromo||hae||118A Predicative Adjectives||Mixed||Owens-1985|
|East Oromo||hae||9A The Velar Nasal||No velar nasal||Owens-1985|
Features and values are defined as per WALS (http://wals.info). Sources are semi-colon separated and optionally indicate a page range in square brackets. Each source maps uniquely to an entry with bibliographical details in a bibtex-file and to a full-text of the source in question. The full-text is an OCR of a scan of the original source (varying quality) and contains no formatting. OCR errors are present, especially for IPA- or non-ASCII-script text in a vernacular. There is a total of 443 source texts supplied. The training data can be downloaded at (http://stp.lingfil.uu.se/~harald/grammar-data-mining.zip).
Submission Instructions and Dates
Authors should submit a paper of up to 8 pages conforming to the RANLP style guidelines (see http://lml.bas.bg/ranlp2019/submissions.php) describing their technical solution to the specific task. The submission should contain a link to a runnable version (e.g. on github.com) of the authors’ solution. This runnable should output a Value (and nothing else) upon running the system: e.g. Given a language-code, the feature of interest, and the source document, the system should output the feature value as examplified below:
- Submission is electronic, using the Softconf submission system for the Grammar Data Mining Workshop
- Papers must be written in English.
- Submitted papers will be peer-reviewed by three experts from a related field.
- At least one author of each accepted paper is required to register for the RANLP 2019 conference, attend the workshop,
and present the paper.
|Workshop paper submission deadline:||28 July 2019|
|Workshop paper acceptance notification:||7 Aug 2019|
|Workshop paper camera-ready version:||20 August 2019|
|Workshop:||5-6 September 2019|
Each submission will be evaluated against a test set of 1000 random datapoints drawn from the same origin as the training data set. Other aspects (such as running time) will not be evaluated.
- Erich Round Senior Lecturer School of Languages and Cultures, The University of Queensland Australia
- Guillaume Segerer Researcher in Linguistics (CNRS, LLACAN), France
- Harald Hammarström Department of Linguistics and Philology Uppsala University, Sweden
- Markus Forsberg Researcher, Språkbanken, University of Gothenburg, Sweden
- Sebastian Nordhoﬀ Language Science Press Unter den Linden 6 10099 Berlin Germany
- Søren Wichmann Researcher, Leiden University Centre for Linguistics, Netherlands
- Shafqat Mumtaz Virk Researcher, Språkbanken, University of Gothenburg, Sweden
- Zeljko Agic Associate Professor in Computer Science, IT University of Copenhagen, Denmark
|Shafqat Mumtaz Virk||https://email@example.com|