Site map

Taraka Rama

PhD Student

Natural Language Processing


I am enrolled at GSLT as a PhD student since September 16th, 2010. My interests are in computational historical linguistics, creation of language resources, and unsupervised methods in NLP. I am also affiliated with Centre for Language Technology in Göteborg.

I was a research assistant in Digital Areal Linguistics project from 2010 till 2014.

I did my Masters in Technology in CL from IIIT-Hyderabad in 2009 and Batchelors in Technology in ICT from DA-IICT.

My Google Scholar page has full list of publications.
Here is my CV.

I am writing my thesis and am very much interested in taking up a research position by September 2015. I am broadly interested in language evolution and collecting comparative data for larger families and applying computationally intensive methods for modeling language change and investigating questions about language prehistory and contact in South Asia.

I contribute code to the Quantitative historical linguistics project. Clone it here.

A must read for any PhD student


with Søren Wichmann. Jackknifing the black sheep: ASJP classification performance and Austronesian. For the proceedings of the symposium "Let's talk about trees", National Museum of Ethnology, Osaka, Febr. 9-10, 2013. pdf


with Lars Borin. Properties of phoneme N -grams across the world’s language families. pdf

Publications before 2010

A Computational Model of the Phonetic Space and Its Applications. In process

Anil Kumar Singh, Sethuramalingam Subramaniam and Taraka Rama. 2010 Transliteration as Alignment vs. Transliteration as Generation for the Purpose of Crosslingual Information Retrieval. Traitement Automatique des Langues, Special Issue on Multilingualism and NLP. Vol. 51, Number 2. 2010. [pdf][Bibtex]

Taraka Rama, Sudheer Kolachina and Lakshmi Bai B. 2009 Quantitative methods for Phylogenetic Inference in Historical Linguistics: An experimental case study of South Central Dravidian. Indian Linguistics, Vol. 70, 2009.[pdf]

Karthik Gali, Sriram Venkatapathy and Taraka Rama. 2009 From Factorial to Quadtratic Time Complexity for Sentence Realization using Nearest Neighbour Algorithm. STIL 2009, Brazil

Taraka Rama, Anil Kumar Singh. 2009 From Bag of Languages to Family Trees from Noisy Corpus. RANLP 2009, Borovets, Bulgaria.[pdf]

Taraka Rama, Karthik Gali. 2009 Modeling Transliteration as a Phrase Based Statistical Machine Translation Problem, NEWS 2009, ACL-IJCNLP 2009, Singapore [pdf]

Taraka Rama, Anil Kumar Singh and Sudheer Kolachina. 2009 Modeling Letter to Phoneme Conversion as a Phrase Based Statistical Machine Translation Problem with Minimum Error Rate Training NAACL HLT 2009 Student Research Workshop, Boulder, Colorado, USA[pdf]

Taraka Rama, Karthik Gali and Avinesh PVS. 2008 Does Syntactic Knowledge help English-Hindi SMT ? Procedings of the NLP Tools contest, ICON 2008.[pdf]

I maintain this page on references to Computational historical linguistics.

I worked on cognate identification and phylogeny for my Masters' thesis.
The contributions of this thesis are as follows:

  • Applied SMT techniques to transliteration shared task (IJCNLP 2009) for the language pair Hindi-English
  • Applied SMT techniques to the Letter to Phoneme conversion task for English, German, and French.
  • Applied phylogenetic techniques such as Maximum Parsimony and Bayesian techniques to the character data of South-Central Dravidian languages.
  • Developed a phonetically well-formulated model for Indian languages and applied it to the corpora for computing the distance between 9 Indian languages. The model can be applied to any phonetic transcription system.
  • The outcome of the work on Dravidian languages is that the work formed the foundation to provide data from the Etymological dictionary of Dravidian languages.

My first supervisor is Lars Borin.
My second supervisor is Søren Wichmann
My third supervisor is Markus Forsberg

I spent the spring of 2012 at Max Planck Institute for Evolutionary Anthropology, Leipzig.

I spent two months (March and April of 2013) as a Guest Researcher at Alpha-Informatica

There is an interesting open-source NLP toolkit for South Asian Languages Sanchay

Here is the thesis. I tried to link the recent work in computational historical linguistics with work done in traditional historical linguistics. If ever you cite it, please cite as bibtex.


Taraka Rama 2015. Automatic cognate identification with gap-weighted string subsequences. Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 1227-1231 [pdf] BibTeX

Taraka Rama, Lars Borin 2015. Comparative evaluation of string similarity measures for automatic language classification. Sequences in Language and TextDe Gruyter Mouton BibTeX

Taraka Rama, Lars Borin 2014. N-Gram Approaches to the Historical Dynamics of Basic Vocabulary Journal of Quantitative Linguistics [pdf] BibTeX

Lars Borin, Anju Saxena, Taraka Rama, Bernard Comrie 2014. Linguistic landscaping of South Asia using digital language resources: Genetic vs. areal linguistics Proceedings of LREC 2014. 3137-3144 [pdf] BibTeX

Taraka Rama 2014. Vocabulary lists in computational historical linguistics University of Gothenburg [pdf] BibTeX

Taraka Rama, Prasanth Kolachina, Sudheer Kolachina 2013. Two methods for automatic cognate identification. Quantitative Investigations in Theoretical Linguistics. 76 [pdf] BibTeX

Taraka Rama, Sudheer Kolachina 2013. Distance-based Phylogenetic Inference Algorithms in the Subgrouping of Dravidian Languages Approaches to Measuring Linguistic DifferencesDe Gruyter Mouton BibTeX

Taraka Rama 2013. Phonotactic diversity predicts the time depth of the world\'s language families PLoS ONE [pdf] BibTeX

Taraka Rama, Prasanth Kolachina 2012. How good are typological distances for determining genealogical relationships among languages? Proceedings of the 24th International Conference on Computational Linguistics [pdf] BibTeX

Taraka Rama 2012. N-gram approaches to the historical dynamics of basic vocabulary Preproceedings of Computational approaches to the study of dialectal and typological variation [pdf] BibTeX

Søren Wichmann, Eric Holman, Taraka Rama, Robert S. Walker 2012. Correlates of reticulation in linguistic phylogenies Language Dynamics and Change. [pdf] BibTeX

Taraka Rama, Lars Borin 2012. Properties of phoneme N -grams across the world’s language families Proceedings of the Fourth Swedish Language Technology Conference (SLTC) BibTeX

Søren Wichmann, Taraka Rama, Eric Holman 2011. Phonological diversity, word length, and population sizes across languages: The ASJP evidence Linguistic Typology [pdf] BibTeX

Sudheer Kolachina, Taraka Rama, Lakshmi Bai 2011. Maximum parsimony method in the subgrouping of Dravidian languages Quantitative Investigations in Theoretical Linguistics [pdf] BibTeX

Taraka Rama, Lars Borin 2011. Estimating Language Relationships from a Parallel Corpus. A Study of the Europarl Corpus NEALT Proceedings Series (NODALIDA 2011 Conference Proceedings). 161-167 [pdf] BibTeX

All publications as BibTeX

I never got the chance to teach. I hope I will be able to teach sometime.

Taraka Rama. Presentation at Licentiate Seminar. January 2014.[pdf]

Taraka Rama. Comparative study of string similarity and vector similarity measures for Bulgarian dialect classification. Poster at Language Diversity Congress, Computational Issues in Studying Language Diversity: Storage, Analysis and Inference. July 2013.

Taraka Rama, Prasanth Kolachina. How good are typological distances for determining genealogical relationships among languages? Poster in COLING 2012, IIT Mumbai[pdf]

Taraka Rama. N-gram approaches to the historical dynamics of basic vocabulary. ESSLI workshop on computational approaches to typological and dialectogical variation. 2012 [pdf]

Taraka Rama (Joint work with Søren Wichmann and Eric W. Holman). Correlates of reticulation in linguistic phylogenies. CLT fall seminar. 2012 [pdf]

Taraka Rama, Lars Borin. Properties of phoneme N-grams across the world’s language families. Fourth Swedish Language Technology Conference, University of Lund. 2012 [pdf]

First year at GSLT. 2011 [pdf]

Taraka Rama, Sudheer Kolachina. Distance-based algorithms in the subgrouping of Dravidian languages. Workshop on comparing approaches to measuring linguistic differences. 2011 [pdf]

Søren Wichmann, Taraka Rama, Eric W. Holman. Phonological diversity, Mean word length and Population Sizes across worlds' languages. CLT Retreat 2011 [pdf]

Sudheer Kolachina, Taraka Rama. Revisiting Unchanged Cognates as criterion in Linguistic Subgrouping. ICHL, Osaka 2011.[pdf]

Taraka Rama, Lars Borin. Estimating language distances from parallel corpus. A study of Europarl corpus NODALIDA 2011. Latvia. [pdf]

Sudheer Kolachina, Taraka Rama, Lakshmi Bai. Maximum Parsimony for subgrouping in Dravidian. QITL, Berlin 2011. [pdf]

Taraka Rama. Explorations in Phoneme N-grams for Automatic Language Classification CLT Seminar, March 2011. [pdf]

Taraka Rama

Contact information

Taraka Rama

Språkbanken, Göteborgs universitet, Box 200
405 30 Göteborg

Visiting address:
Lennart Torstenssonsgatan 8

+46 (0)31 786 4533

Contact form

© University of Gothenburg 2009, Box 100, 405 30 Gothenburg, Sweden
Tel +46 31 786 0000, Contact

About the site