Skip to main content

Word2Vec Study on Kubhist

In this page summarizes a study on Kubhist, where we investigate the Word2Vec model [W2V] on a Swedish historical newspaper archive, Kubhist [Kubhist] spanning 1749-1925. We consider this a small feasibility study on neural embeddings for the Kubhist material and, assuming the results show reasonable quality data, a starting point for automatic word sense change detection on the basis of sense-differentiated word embeddings. We make use of 11 words, nyhet 'news', politik 'politics', telefon 'telephone', telegraf 'telegraph', kvinna 'woman', man 'man', glad 'happy', retorik 'rethoric', resa 'travel' and musik 'music'.

We run the W2V models for each year (1749-1925) of the Kubhist dataset separately. Because vectors cannot be compared directly when trained on different corpora (they need to be projected onto the same space first) we make use of the words that closest to a vector. That means, for each word w that we investigate from the list above, we print out the ten words corresponding to the 10 closest vectors to the vector of w for a given year. When all years are processed, we have 10 words a table for each word w, where each line corresponds to a year and contains the 10 closest words, assuming that the model could produce any results. Certain years will have no words because a vector could not be found corresponding to w, i.e., there was too little data evidence for w in that year. Each line is of the form: "1782 0.0 fattigdom koka möbel utgift lampa dryck grön springa utbekommande tillåtes" where the first number indicates year, the second one is the Jaccard similarity between the top-10 words for year y (=1782 in this case) and the preceding year, and then the 10 top words are listed.

[W2V] . Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. CoRR abs/1301.3781 (2013)
[Kubhist] https://spraakbanken.gu.se/korp/?mode=kubhist

 

Attachment: W2VStudyOnKubhist.zip