Linguistics is entering the age of big data and e-science. We are now at a point where it is possible to see how new research questions can be formulated - and old research questions addressed from a new angle or established results verified - on the basis of exhaustive collections of data, rather than small samples. In this project we will study an old linguistic research question using big data: the South Asian linguistic area hypothesis. South Asia is regularly mentioned in the literature as a classic linguistic area, but systematic investigations of this claim are lacking and the need for a more thorough study has been stressed repeatedly. This is our primary empirical objective.
Grierson’s Linguistic Survey of India (LSI; 1903-1927) still remains the most complete source on South Asian languages. Its 21 tomes (9500 pages) cover 723 South Asian linguistic varieties. Comparable lexical and grammatical information is provided for 268 varieties representing the four major South Asian language families. We will work with an extensive data set extracted from a digitized version of LSI and develop computational methods for conducting large-scale quantitative lexical and grammatical comparative studies in order to establish typological profiles of the four major families. This will give us a firm empirical basis for evaluating the South Asian areal hypothesis.