Digital areal linguistics

The goal of this project is to create a database of comparable lexical items in a number of representative languages spoken in the Himalayan region in India and to use this database for investigating the Himalayas as a linguistic area.

The project is a collaboration with the IDS project (MPI Leipzig, Germany), an international initiative for collecting comparable basic vocabulary in a large number of languages, where the database can be used for various different purposes. The database will enable the project to investigate questions relating to the Himalayan region as a linguistic micro-area.

The results of this project will contribute to methodological development in digital documentation and linguistic typology research. It will add Himalayan languages to the IDS. It will also contribute to our knowledge of the languages of this region, of the Himalayas as a linguistic area, and of areal-typological linguistics in general.

The project is financially supported by the Swedish Research Council for the period 2010-2014 (VR dnr 2009-1448).

Project outcomes

Word lists for South Asian languages

One goal of the project was to produce basic vocabulary lists based on the concept sets from the Intercontinental Dictionary Series (IDS) project, plus the additions defined in the Loanword Typology (LWT) project (both conducted at the former Department of Linguistics, Max Planck Institute for Evolutionary Anthropology in Leipzig.

The languages are shown with their names and ISO 639-3 codes in parentheses. In case a language has no ISO 639-3 code, this is noted.

How to cite:

If you are using these wordlists for any purposes, use the following reference to cite the source (follow the BibTeX link for full bibliographical information):

Lars Borin, Bernard Comrie, Anju Saxena (2013): The Intercontinental Dictionary Series – a rich and principled database for language comparison, in Approaches to Measuring Linguistic Differences / ed. by Lars Borin ; Anju Saxena, pages 285-302 BibTeX




A (revised) Swedish IDS/LWT (ISO 639-3 code: swe) list is available for download in LMF format here. It can also be searched online through Karp.


Creative Commons license
This work is licensed under a Creative Commons Attribution 4.0 International License.


  1. Taraka Rama, Lars Borin (2015): Comparative evaluation of string similarity measures for automatic language classification, in Sequences in Language and Text BibTeX
  2. Taraka Rama, Lars Borin (2014): N-Gram Approaches to the Historical Dynamics of Basic Vocabulary, in Journal of Quantitative Linguistics, volume 21, issue 1, pages 50-64 BibTeX
  3. Lars Borin, Anju Saxena, Taraka Rama, Bernard Comrie (2014): Linguistic landscaping of South Asia using digital language resources: Genetic vs. areal linguistics, in Proceedings of LREC, May 26-31, 2014, Reykjavik, Iceland, pages 3137-3144 BibTeX
  4. Lars Borin, Anju Saxena (2013): Approaches to Measuring Linguistic Differences BibTeX
  5. Lars Borin (2013): The why and how of measuring linguistic differences, in Approaches to Measuring Linguistic Differences / edited by Lars Borin, Anju Saxena, pages 3-26 BibTeX
  6. Anju Saxena, Lars Borin (2013): Carving Tibeto-Kanauri by its joints: Using basic vocabulary lists for genetic grouping of languages, in Approaches to Measuring Linguistic Differences, pages 175-198 BibTeX
  8. Taraka Rama, Sudheer Kolachina (2013): Distance-based Phylogenetic Inference Algorithms in the Subgrouping of Dravidian Languages, in Approaches to Measuring Linguistic Differences / edited by Lars Borin, Anju Saxena. BibTeX
  9. Taraka Rama (2013): Phonotactic diversity predicts the time depth of the world's language families, in PLoS ONE, volume 8, issue 5, pages e63238 BibTeX
  10. Taraka Rama, Kolachina Prasanth, Sudheer Kolachina (2013): Two methods for automatic cognate identification, in Proceedings of the 5th Conference on Quantitative Investigations in Theoretical Linguistics QITL-5. University of Leuven, 12-14 September 2013, issue 5, pages 76-80 BibTeX
  11. Taraka Rama, Kolachina Prasanth (2012): How good are typological distances for determining genealogical relationships among languages?, in Proceedings of the 24th International Conference on Computational Linguistics BibTeX
  12. Taraka Rama, Lars Borin (2012): Properties of phoneme N -grams across the world’s language families, in Proceedings of the Fourth Swedish Language Technology Conference (SLTC) BibTeX
  13. Lars Borin (2012): Core vocabulary: A useful but mystical concept in some kinds of linguistics, in Shall we play the festschrift game ? Essays on the Occasion of Lauri Carlson's 60th Birthday, pages 53-65 BibTeX
  14. Søren Wichmann, Eric Holman, Taraka Rama, Robert S. Walker (2011): Correlates of reticulation in linguistic phylogenies, in Language Dynamics and Change., volume 1, issue 2, pages 205-240 BibTeX
  15. Taraka Rama (2012): N-gram approaches to the historical dynamics of basic vocabulary, in Preproceedings of Computational approaches to the study of dialectal and typological variation BibTeX
  16. Taraka Rama, Lars Borin (2011): Estimating Language Relationships from a Parallel Corpus. A Study of the Europarl Corpus, in NEALT Proceedings Series (NODALIDA 2011 Conference Proceedings), volume 11, pages 161-167 BibTeX
  17. Anju Saxena, Lars Borin (2011): Dialect Classification in the Himalayas: a Computational Approach, in NEALT Proceedings Series (NODALIDA 2011 Conference Proceedings), volume 11, pages 307-310 BibTeX
  18. Sudheer Kolachina, Taraka Rama, Lakshmi Bai (2011): Maximum parsimony method in the subgrouping of Dravidian languages, in Quantitative Investigations in Theoretical Linguistics, volume 4, pages 52-56 BibTeX
  19. Søren Wichmann, Taraka Rama, Eric Holman (2011): Phonological diversity, word length, and population sizes across languages: The ASJP evidence, in Linguistic Typology, volume 15, issue 2, pages 177-197 BibTeX
  20. Saxena, Anju 2011. Towards empirical classification of Kin­nauri varieties. P. K. Austin et al. (eds), Conference on Language Documentation and Linguistic Theory 3. London: SOAS.
  21. Saxena, Anju 2010. A survey of the linguistic situation in Kin­naur. Some preliminary observations. The 43rd Interna­tional Conference on Sino-Tibetan Langu­ages and Lin­guistics, 15-18.

Workshops and presentations


  1. Workshop on using standardized word lists in linguistic data collection
    Venue: University of Gothenburg. 26th October 2010.
    Co-organizers: Anju Saxena & Lars Borin
    (about 50 participants; participants from Scandinavian countries)
  2. Workshop on comparing approaches to measuring linguistic differences
    Venue: University of Gothenburg. 24–25 October 2011.
    Co-organizers: Anju Saxena, Lars Borin & K. Taraka Rama
    (about 70 participants; participants from Belgium, Canada, Denmark, Finland, France, German, Holland, Sweden and Thailand)
  3. Gabmap tutorial organized by John Nerbonne (Rijksuniversiteit Groningen)
    Venue: University of Gothenburg. 26 October 2011 (co-sponsored event)
  4. International workshop on linguistic microareas in South Asia
    Venue: Uppsala University. 5–6 May 2014 (co-sponsored event)
    Organizer: Anju Saxena
  5. International workshop on multilingual databases
    Venue: Språkbanken, University of Gothenburg. 17 October 2014.
    Organizer: Lars Borin

Invited presentations

  1. Bernard Comrie delivered an invited lecture at the Department of Swedish, University of Gothenburg, with the title “Ditransitive constructions: a typological view” (presentation of joint work by Andrej Malchukov, Martin Haspelmath and Bernard Comrie) October 2010
    (co-sponsored event)
  2. Anju Saxena delivered a plenary lecture at the Third Conference on Language Documentation and Linguistic Theory (LDLT3) with the title “Towards empirical classification of Kinnauri varieties”.
    Venue: School of Oriental and African Studies, London. 19–20 November 2011
  3. Lars Borin delivered an invited talk at the Language diversity congress. Computational issues in studying language diversity: Storage, analysis and inference with the title “For better or for worse? Going beyond short word lists in computational studies of language diversity”.
    Venue: Rijksuniversiteit Groningen, The Netherlands, 18–20 July 2013

Project duration

Project members

  • Lars Borin (PI)
  • Taraka Rama (PhD student)
  • Anju Saxena (Researcher)
    Uppsala University
  • Bernard Comrie (Advisor)
    University of California at Santa Barbara


  • Vetenskapsrådet (2009-1448)

Research topics

  • language technology
  • areal linguistics
  • linguistic typology
  • computational linguistics
  • Lexicography

Project type

  • Research project
  • Externally funded