Position statements for Sustainable language representations for a changing world, 31 May 2021

Here are the submitted position statements for the NoDaLiDa workshop Sustainable language representations for a changing world, 31 May 2021. We have tried to match them with the three workshop topics (societal, technical, and legal challenges), but it's not always that easy, so don't be alarmed if one statement happens to be in the wrong place.

Christina Tånnander and Björn Westling: Language models at the Swedish Agency for Accessible Media
Marina Santini, Evelina Rennes, Daniel Holmer and Arne Jönsson: Human-in-the-Loop: Where Does Text Complexity Lie?
Nikolai Ilinykh and Simon Dobnik: Taking BERT for a walk: on the necessity of grounding, multi-modality and embodiment for impactful NLP
Jenny Kunz: Data Transparency and Interpretability of Language Representations
Riley Capshaw, Eva Blomqvist, Marina Santini and Marjan Alirezaie: BERT is as Gentle as a Sledgehammer: Too Powerful or Too Blunt? It Depends on the Benchmark

Language models at the Swedish Agency for Accessible Media

Christina Tånnander and Björn Westling (Swedish Agency for Accessible Media / Myndigheten för tillgängliga medier, MTM)

The Swedish Agency for Accessible Media (MTM) is a governmental agency with the mission to provide people with vision impairments or reading difficulties with text in accessible formats, for example talking books (with human or synthetic voices) or Braille. We operate according to an exception in the Swedish copyright law, which states that MTM is allowed to duplicate editions of books in accessible formats for its target groups.

MTM’s use of TTS differ from most other users/producers of synthetic speech. While other actors’ main focus is to offer a service where the user can convert text to speech, MTM produces the synthetic talking book as a product, to be delivered to the users in a ready-to-read package. Most of the texts MTM adapt are protected by the Swedish copyright law, and MTM is according to that law obliged to convert the books without changing the content.

Using language models

MTM uses text-to-speech synthesis (TTS) to produce more than 100 daily newspapers and 1 000 talking books per year, and have an annual production of about 600 books in Braille. In the production of synthetic talking books, a range of different language models or other language models/representations are used, for example part-of-speech tagger, compound decomposition and generation, grapheme-to-phoneme-conversion, pronunciation lexicons, frequency lists, and text-to-speech conversion. Language representations are also involved in other areas, for example in tools for detecting OCR errors and hyphenation in text-to-braille conversion.

It is well-known that synthetic speech contain errors, and sometimes words are mispronounced beyond recognition. As mentioned above, we need to render the books without changing the content, which means that we instantly need to correct detected errors. Using black boxes is often not an option. Instead, we try to cooperate with our suppliers of such services, to get access to at least the most necessary information (such as coverage of word pronunciations) in order to make it possible to control the system, for example by using SSML (Speech Synthesis Markup Language). We could conceivably use more third-party models, but the lack of information about what they can and can’t do sometimes leaves us with no option other than to use our own models.

Providing language models

Finally, it can be mentioned that MTM possesses large quantities of data that could be used to train models, for example more than 100 000 talking books, the text belonging to these books, and large pronunciation lexicons. We are currently looking at what kind of data we can share without violating copyright law or GDPR (voices are personal information), but even when we can, MTM cannot be put in a situation where the agency is responsible for these, nor do we have the capacity to take on a supporting role in how to download or use the models. In these questions, we are cooperating with the national research infrastructure Språkbanken Tal.

Human-in-the-Loop: Where Does Text Complexity Lie?

Marina Santini (RISE Sweden), Evelina Rennes, Daniel Holmer, and Arne Jönsson (Linköping University)

In this position statement, we would like to contribute to the discussion about how to assess quality and coverage of a model. In this context, we verbalize the need of linguistic features’ interpretability and the need of profiling textual variations. These needs are triggered by the necessity to gain insights into intricate patterns of human communication. Arguably, the functional and linguistic interpretation of these communication patterns contribute to keep humans’ needs in the loop, thus demoting the myth of powerful but dehumanized Artificial Intelligence.

The desideratum to open up the “black boxes” of AI-based machines has become compelling. Recent research has focussed on how to make sense and popularize deep learning models and has explored how to “probe” these models to understand how they learn. The BERTology science is actively and diligently digging into BERT’s complex clockwork. However, much remains to be unearthed: “BERTology has clearly come a long way, but it is fair to say we still have more questions than answers about how BERT works”. It is therefore not surprising that add-on tools are being created to inspect pre-trained language models with the aim to cast some light on the “interpretation of pre-trained models in the context of downstream tasks and domain-specific data”.

Here we do not propose any new tool, but we try to formulate and exemplify the problem by taking the case of text simplification/text complexity. When we compare a standard text and an easy-to-read text (e.g. lättsvenska or simple English) we wonder: where does text complexity lie? Can we pin it down? According to Simple English Wikipedia, “(s)imple English is similar to English, but it only uses basic words. We suggest that articles should use only the 1,000 most common and basic words in English. They should also use only simple grammar and shorter sentences.” This characterization of a simplified text does not provide much linguistic insight: what is meant by simple grammar? Linguistic insights are also missing from state-of-the-art NLP models for text simplification, since these models are basically monolingual neural machine translation systems that take a standard text and “translate” it into a simplified type of (sub)language. We do not gain any linguistic understanding, of what is being simplified and why. We just get the task done (which is of course good). We know for sure that standard and easy-to-read texts differ in a number of ways and we are able to use BERT to create classifiers that discriminate the two varieties. But how are linguistic features re-shuffled to generate a simplified text from a standard one? With traditional statistical approaches, such as Biber’s MDA (based on factor analysis) we get an idea of how linguistic features co-occur and interact in different text types and why. Since pre-trained language models are more powerful than traditional statistical models, like factor analysis, we would like to see more research on "disclosing the layers" so that we can understand how different co-occurrence of linguistic features contribute to the make up of specific varieties of texts, like simplified vs standard texts. Would it be possible to update the iconic example

King − Man + Woman = Queen

with a formalization, such as:

α(appositions+nouns) + β(readability+verbs) − γ(pronouns+adjectives) = ScientificWriting

Taking BERT for a walk: on the necessity of grounding, multi-modality and embodiment for impactful NLP

Nikolai Ilinykh and Simon Dobnik (University of Gothenburg, CLASP)

'Understanding' language has recently been argued to be dependent on whether its representations are grounded in a non-linguistic domain that the language is referring to, e.g., a database (Bender, Koller 2020; Merrill, Goldberg, Schwartz, Smith 2021). A significant amount of human knowledge is encoded in text, and NLP has proven very successful to operate and, to some extent, reason with such information. However, the use of language in human behaviour and, therefore, the purpose of NLP is beyond text-only tasks. Through grounding, we can connect NLP research and applications where communication is needed, e.g., answering questions about maps. Ultimately, we want to construct embodied agents that co-exist with us, either in the real world or connected to particular applications such as car navigation systems, which can see, move, interact and hold a meaningful conversation with humans. We argue that an impactful NLP research must move beyond the uni-modal textual domain to address real-world needs while preserving connection with the existing models designed for the textual domain. Therefore, how could our text-based models be realized in terms of embodiment? We believe that such models would greatly extend the applications of the NLP to new domains (or improve the current ones) and therefore lead to a positive scientific impact on the environment and human society, e.g., building agents to physically and informationally assist the elderly when (if) another pandemic arises.

Our first point for discussion is how are we training grounded architectures and which environment we are using? Our agents can act in either a virtually modelled world or a natural environment. Most of the existing research has focused on modelling virtual agents due to the complexity of building a task-specific real-world scene. However, training a virtual agent means that we need to resolve several questions, which typically never arise in the real world: if we are using vision, do we want to build our agent in a 2D or a 3D world? In addition, the meaning of words sometimes has to be inferred and understood based on the environment and background knowledge (e.g., metaphors). Then, how do we supply our agent with common-sense knowledge about the state of the things in the world? The agent cannot learn this type of knowledge exclusively from text alone; therefore, we must provide the agent with access to other sources of knowledge and connect it with textual knowledge. How do we transfer knowledge from the virtual domains to the real world?

Another issue is related to the interpretability and explainability of our task-specific models. To understand where and how such agents acquire knowledge, we need to decide on the granularity of interpretation of their behaviour. How restricted should the task for the agent be so that the system is still generally useful? One way is to go with a relatively simple multi-modal task (e.g., visual question answering). However, we lose on the aspect of building models of human linguistic and non-linguistic behaviour, which might potentially occur in more 'natural' and less limited environments (e.g., visual dialogue: smooth conversation wrapped in a single discourse). How do all tasks that we typically model hang together?

In sum, we hope that our position statement on a few of the issues related to applying the NLP to grounded agents encourages more NLP researchers to design models with a thorough understanding of how explainable and applicable these models are when released "into the wild".

Data Transparency and Interpretability of Language Representations

Jenny Kunz (Linköping University)

Transparency is a fundamental criterion to establish trust. Open source software is often considered more trustworthy, as it is verifiable how exactly the software operates. Users can check for intended and unintended backdoors and for other harmful behavior.

While the developers of language representations often publish the source code and models, they hardly have full control over their model's behavior.

Even for themselves, it can be difficult to detect attacks especially at the data level. Not only can already created data be modified (or "poisoned"), but in the case of common-crawled data, web pages can be modified or social media platforms can be flooded with content that the attacker wishes the model to contain, e.g. with political propaganda or for marketing purposes.

Such attacks aren't visible via common metrics and NLP tasks. They aren't likely to affect the GLUE or SuperGLUE score that models are often evaluated on, nor can they easily be probed as the developer would need to know what exactly to probe for. Even when discovering unwanted behavior it can be hard to credit it to an attack rather than to generic properties of the training data: When the data was crawled from social media sites, it is expectable that all kinds of views are included. A certain number of trolls advertising products or spreading propaganda for political groups or regimes is common, and even authentic users advocate such views. An artificially increased number of such posts can however bias the data and thereby the model in a harmful way.

For the user of the model it is usually infeasible to verify what exact data the model was trained on. The exact reproduction of the model parameters requires access to exactly the same data in the same order and the very exact preprocessing and training strategy, initializations and other randomized components. And with current models, the reproduction also requires significant financial and computational resources, not to mention time and skills.

Trust, when dealing with pre-trained language representations, has much more to do with data transparency than with releasing the source code. What data was used, who takes responsibility for it? Has the data set been verified and documented before training a model with it? If the data was crawled, were measures taken to remove unwanted bias, and if so, what exactly?

The problem of data transparency relates to the fact that the inner workings of neural network models remain opaque, and will do so in the foreseeable future. The interpretation of language representations is gaining huge popularity, but its methodology remains explorative. Only well-known undesired biases may be feasible to probe for post hoc by using specific prompts. I argue that as long as this remains true and the user does not have feasible options to check the model for harmful content, the process of data acquisition, selection and processing should documented rigorously, and be as open as the code.

BERT is as Gentle as a Sledgehammer: Too Powerful or Too Blunt? It Depends on the Benchmark

Riley Capshaw, Eva Blomqvist (Linköping University), Marina Santini (RISE Sweden), and Marjan Alirezaie (Örebro University)

In this position statement, we wish to contribute to the discussion about how to assess quality and coverage of a model.

We believe that BERT's prominence as a single-step pipeline for contextualization and classification highlights the need for benchmarks to evolve concurrently with models. Much recent work has touted BERT's raw power for solving natural language tasks, so we used a 12-layer uncased BERT pipeline with a linear classifier as a quick-and-dirty model to score well on the SemEval 2010 Task 8 dataset for relation classification between nominals. We initially expected there to be significant enough bias from BERT's training to influence downstream tasks, since it is well-known that biased training corpora can lead to biased language models (LMs). Gender bias is the most common example, where gender roles are codified within language models. To handle such training data bias, we took inspiration from work in the field of computer vision. Tang et al. (2020) mitigate human reporting bias over the labels of a scene graph generation task using a form of causal reasoning based on counterfactual analysis. They extract the total direct effect of the context image on the prediction task by "blanking out" detected objects, intuitively asking "What if these objects were not here?" If the system still predicts the same label, then the original prediction is likely caused by bias in some form. Our goal was to remove any effects from biases learned during BERT's pre-training, so we analyzed total effect (TE) instead. However, across several experimental configurations we found no noticeable effects from using TE analysis. One disappointing possibility was that BERT might be resistant to causal analysis due to its complexity. Another was that BERT is so powerful (or blunt?) that it can find unanticipated trends in its input, rendering any human-generated causal analysis of its predictions useless. We nearly concluded that what we expected to be delicate experimentation was more akin to trying to carve a masterpiece sculpture with a self-driven sledgehammer. We then found related work where BERT fooled humans by exploiting unexpected characteristics of a benchmark. When we used BERT to predict a relation for random words in the benchmark sentences, it guessed the same label as it would have for the corresponding marked entities roughly half of the time. Since the task had nineteen roughly-balanced labels, we expected much less consistency. This finding repeated across all pipeline configurations; BERT was treating the benchmark as a sequence classification task! Our final conclusion was that the benchmark is inadequate: all sentences appeared exactly once with exactly one pair of entities, so the task was equivalent to simply labeling each sentence. We passionately claim from our experience that the current trend of using larger and more complex LMs must include concurrent evolution of benchmarks. We as researchers need to be diligent in keeping our tools for measuring as sophisticated as the models being measured, as any scientific domain does.

Sustainable language representations, position statements