Recently, we have seen a surge of methods that claim to embed meaning from textual corpora. But is that possible? Can text really reveal meaning, and if so, can current NLP methods detect it? Can our methods, as they some times claim, understand? Perhaps the larger question is the following: can we bring meaning to words using only the information stored in text? This question is essential for any Artificial Intelligence (AI) system that uses text as a basis.
Let us take the example of fear. To us, fear is an emotion associated with certain specific characteristics; anxiety, stress, increased heart rate, sweating, and so on. When we write down the feeling, it becomes a word, fear, a linguistic form that refers to a non-linguistic referent (i.e., the world phenomenon that is felt as an increase in our heart rate). The same word carries with it a meaning, a link between the form and the referent. Now, can meaning be assigned to the word fear by someone (machine or human) that has not experienced fear?
When we say that animals can smell our fear, it is not the word they can smell, nor is it our emotion. Instead, they smell the bodily consequences of our fear: when we are afraid, we smell differently, there is an increase of our heart rate, and other physical signs of our distress. These very tangible signals can be measured, monitored and used in all sorts of applications. Recently, AI researchers at Microsoft were able to show that an algorithm for self-driving cars, that had (partial) access to human’s fear in different situations, could decrease its error rate by 25%. By putting sensors on a driver’s finger, the algorithm could detect how scared a human driver was in different situations and thus use this to slow down in situations where other, physical clues were insufficient. These results were obtained using a total of 80 minutes of driving by four different drivers in a simulator. Imagine what would happen if they had access to hundreds of hours of sensory input.
In NLP, much of the recent break through has been made using neural embedding techniques, where meaning representation of words is a central component. These are either learned directly on the text, or pretrained and transferred from other domains, where there is typically much larger-scale data. [efn_note] In fact, using pre-trained contextual embeddings, for example, BERT models, is becoming the de-facto standard. “Average” or “static” word embedding techniques are having difficulty measuring up. However, there is no fair comparison since few others have access to the same amounts of data used for training the BERT model. [/efn_note] Typically, the vectors are build using textual corpora with little or no linguistic analysis (like lemmatization or dependency information). Still, for many of our standard tasks, they seem to perform extremely well. In particular fields, where machine translation is perhaps the most outstanding, performance has improved multi-fold.
Much of the debate around neural embedding spaces is whether or not they can indeed model meaning of words. By having access to only the texts, how much can be learned about our language? When we are fluent in a language, form and meaning are blended and it can be difficult to tease these apart. Things are just how they are, and we do not need more than the words to understand most of the full. A word, or a phrase, carries with it the cultural and social context in which it was produced, and as long as we share those, we need only the words. Have you ever tried reading a historical newspaper text? (You can try here.) Interpretation is not straight forward. Does a word mean A or B in a specific context? Many times, it is almost impossible to use a single sentence to understand the meaning of a word in the sentence. Out of the social and cultural contexts, the difficulty in understanding is increased multifold and the text becomes insufficient; we need additional clues. Is happy a strong word? Is awesome? The exact same thing happens to modern texts, it’s just that we have access to the modern context without quite being aware of it. We do not always have the modern context either, because of synchronic variation. See “Variation is Key” by Claire Bowern.[efn_note] https://www.aclweb.org/anthology/W19-4706/ [/efn_note]
In the process of understanding, we cannot tease apart the form and the worldly information we have stored in our minds. For that very reason, that text only encodes the form and not the social and cultural contexts needed to understand text, the text is not sufficient to allow us to build coherent embedding spaces that capture meaning fully. So how do we overcome this difficulty? To borrow the words of Emelie Bender, in fact much of this blog entry was inspired by a talk given by Prof. Bender, we need to give our models training data in the form of text AND some grounding. What if we let the models smell our fear, or feel our heart rate? Would they then be able to capture the strength of fear, or what it means to be awesome or extremely cool? Perhaps, this is insufficient for all classes of words. Can we differentiate running from fear when both increase our heart rates in similar ways? Perhaps we need an EEG? Probably, we cannot learn the meaning of a table with emotional responses in a human, but we could couple our models with many more sensory ques. Facial expression will help us learn which things we like or dislike, which things that make us happy and sad, which that excited us. What do we need to learn language as a whole? To teach our models to actually understand, or at the very least, to model our understanding? And when can we say that we have succeeded? To me, these are the million-dollar questions.