Approaches to corpus searches - a mini-workshop at Språkbanken Text

Submitted by Elena Volodina on 2024-05-21

1. Workshop in a nutshell

On April, 23, 2024 Språkbanken Text and Huminfra organized a well-attended mini-workshop Approaches to corpus searches. The workshop was devoted to showcasing new perspectives to corpus searches from different angles: techical, user-oriented, visualization. Four invited experts gave presentations, namely Špela Arhar Holdt from Slovenia, Johannes Graën from Switzerland, and Yousuf Ali Mohammed and Peter Ljunglöf from Sweden.

Figure 1: The audience at the mini-workshop.

Špela’s talk introduced the audience to a tool designed for teachers of Slovene with support for pedagogically relevant types of searches in one specific corpus. The tool boasts a number of very attractive functionalities and allows to represent learner-written esssays in parallel: the original version and its correction, with a possibility to define searches in either of them.

Figure 2: Špela Arhar Holdt is giving her presentation.

Johannes outlined a bigger picture of infrastructural components that have been recently implemented at the University of Zurich with the focus on one particular corpus search platform, LCP, the LiRI Corpus Platform. The strengths of the platform is that it allows to search multiple corpora in multiple modalities and supports queries in a new query language, i.e. not in CQL (corpus query language) that is the current standard in the field. Johannes demonstrated several searches and outlined future scenarios.

Figure 3: Johannes Graën is giving a presentation.

Figure 4: Yousuf Ali Mohammed is talking about Strix, Språkbanken Text’s corpus search tool for on document level.

Yousuf Ali Mohammed (Samir) presented the current version of Strix for searches on the document level demonstrating various new visualization techniques, such as word rain for exploring word prominence and topicality of specific text, and visualization of statistics

2. What can we take from this?

The multitude of perspectives on corpus searches outlined in the four presentations have demonstrated that there are various open questions in the field.

Markus Forsberg, the director of Språkbanken Text, says that the fundamental question in this context is how much (richly annotated) data we want to be able to search in simultaneously, on the one hand, and selectively, on the other hand. Should it be just millions of tokens from one domain or billions (or more) from various domains? Do we always want to search in the whole dataset, or do we want to be able to select various subsets of the data before the search? And how homogeneous and comparable should these cross-domain corpora be, in that case? This asks for a further discussion of the rather problematic concept corpus, as defined by our tools.

Some of the demonstrated search interfaces were really nice, for example, the ones presented by Špela and Johannes, says Markus, but one potential downside is that they are tuned to particular data collections. Generalization over many corpora calls for a compromise in favor of, for example, common information categories that make the corpora interoperable. Which also might mean abandoning some of the specificities of each particular corpus. The question is how we should prioritize. Should we continue allowing broad searches in many corpora in Korp, as we do today, or opt for more specialization? Or can we do both?

This hints at the need to take a user perspective – do we know our users and their needs? What is a hypothetical ideal tool from their perspective? Špela Arhar Holdt believes "an ideal corpus search tool is one that ’ideally’ supports the user in their specific task. For some tasks, access to diverse types of corpora, advanced search functionalities, and various data visualization options is essential. However, in many scenarios, the simplicity of the search process and clarity of the user interface are more important than an abundance of features. For example, the primary audience for the concordancer I presented are primary and secondary school teachers. In our case, the ideal tool would address their typical needs without overburdening them with excessive functionality. The other two tools that were presented offer a wider variety of options, aiming to accommodate as many different search scenarios as possible. This is particularly useful for researchers who need a tool that can adapt to various (also) unforeseen research requirements."

This perspective – one tool for one corpus – highlights another problem, though, namely, that we are kind-of doing the "same" implementations on repeat, which is ineffective in many respects. In Špela’s opinion, "we are, to some extent, reinventing the wheel, as concordancers and other methods for corpus visualization have been common in our community for decades. The fundamental concepts behind these tools are largely similar." She brings another problem to the attention, namely, that rarely do we encounter preexisting solutions that are high-quality, user-friendly, and available under an open license; often, one of these aspects is lacking.

One of the desired features for corpus searches is the possibility to search on the document level. This area is less researched and best practices are yet to be established, which is why we should spend more efforts on trying out different approaches and refining them. This is where Samir’s (Yousuf’s) work with Strix is coming into the picture. In Samir’s words, the most prominent questions in this respect become (1) the semantic search – where a lot of research is still necessary, and (2) ways of visualization – where a lot of experimentation is still ahead. Creativity, thinking out of the box and application of LLMs are the way forward, in Samir’s opinion.

If we continue searching in large corpora collections (like in Korp and Strix), faster methods become critical. In Peter Ljunglöf’s words, abstracting from standards offered by the corpus workbench and experimenting with other approaches is necessary to achieve the desired results. And why not create a new stardard and replace the corpus workbench altogether (if necessary)? ...which is essentially what he and his colleagues are working on.

Finally, in the words of Špela, "maybe the absence of a perfect tool doesn’t mean we can’t create one, especially with wider international cooperation"?

3. Open questions

We still have a large number of questions to consider when it comes to corpus searches, some of which are summarized below, as a reminder to all of us interested in this area:

1. Ideal search tool (both user experience and technical implementation):

a) Should such an ideal tool be cut for particular corpus and its individual characteristics, or should it allow to search lots of different corpora simultaneously?

b) Do we tend to re-invent a wheel, kind-of "same" implementations on repeat? Is that effective?

c) What should be the compromise?

2. Who are the users? What do we know about their problems/issues and are there any isues you haven’t addressed from their point of view?

3. What are the common points of the work on corpus search interfaces? Are there any opportunities for integration?

4. The current standard in the field is corpus workbench.

a) How important is corpus workbench? (and why)

b) What are its advantages and disadvantages?

c) Is corpus workbench restrictive in some ways? (e.g. conversions from other formats, and loss of information; dependence on tokens, etc.)

d) What are the possile motives for sticking with it? (e.g. with regards to sizes of the data and effectiveness of searches)

5. Is there a need for a new standard to corpus searches? Is there a place for AI-based approaches to searches? Can we? Should we?

6. Currently, most corpus search tools are focused on annotations on a token level. But many who search are interested in larger text chunks, whole documents, text semantics or context.

a) How important is it to develop approaches to document searches?

b) How important is semantic search?

c) Would that type of search be possible with the corpus workbench?

7. Alignment and visualization.

a) The results of searches on the token level are visualized through KWIC (key word in context). How should results of document seaches be visualized?

b) How should results of semantic searches be visualized?

8. To summarize: Which approach(es) should we advocate in corpus search in the future?

4. Speakers’ bios and abstraits

Špela Arhar Holdt is a research associate at the Centre for Language Resources and Technologies (CJVT), University of Ljubljana. Her expertise lies in corpus linguistics, encompassing the development of openly accessible language resources and the establishment of efficient machine- and user-assisted lexicographic processes for modern Slovene. She co-authored more than 50 language resources at CLARIN.SI and holds a co-editor role for various lexical resources, including the Thesaurus of Modern Slovene, the Collocation Dictionary of Modern Slovene, and the Sloleks Morphological Lexicon. She is particularly interested in resources and tools designed for language learning. More information is on Špela’s homepage.

Title: A specialised concordancer for corpora with annotated language corrections

Abstract. In this presentation, we introduce a new specialized concordancer for corpora with annotated language corrections (e.g. learner and developmental corpora). Through a variety of search scenarios, we'll show the tool’s capabilities, emphasizing how users can easily search and examine features of both the learner and the corrected texts. The concordancer serves as a complement to the Svala annotation system and the newly proposed XML TEI guidelines for these corpora, bridging the gap between data annotation and analysis. Warmly welcome!

__________________

Johannes Graën built up the Language Technology section at LiRI, the Linguistic Research Infrastructure at the University of Zurich. Together with his team, he designed and implemented LCP, the LiRI Corpus Platform, which is a modular system for searching different types of corpora, from regular text corpora to multimodal ones. The core of LCP is a PostgreSQL database with an optimized representation of the respecive corpus’ data and indices thereon as well as a common backend that translates corpus queries into SQL corpus matching the respective corpus’ schema. More information is on Johannes’ old home page at LiRI.

Title: LCP -- the LiRI Corpus Platform

Abstract. Over the past three years, we have been developing a novel technology for querying corpora of different kinds. We found that, although plenty of specialized tools are freely available (CWB, ANNIS, NoSketchEngine, etc.), none of them are suitable for large corpora (> 1b tokens) or multimodal data (audio, video, images) effectively. Furthermore, the query languages supported by those tools vary in terms of expressiveness.

Our corpus platform LCP is designed to cater to the diverse needs of linguists and researchers from related fields. Its modular structure enables the creation of customized user interfaces on top of a shared infrastructure, while allowing users to import corpora with tailored structures.

__________________

Yousuf Ali Mohammed (Samir) is a research engineer at Språkbanken Text, Sweden, where he primarily works on the Strix platform experimenting with novel approaches to document searches and visualiation. He has previously worked in the project L2 profiles as a systems developer. More information is on Samir’s webpage.

Title: Strix -- for wiser text visualisation

Abstract. Strix is a text visualisation tool currently developed at Språkbanken Text. This tool gives the opportunity to researchers, teachers, students and others who work on text data, to visualise the whole document (or text) together with the annotations on text level, sentence level and word level. The current version of Strix has a simple search functionality (word or phrase) and filtering option based on the metadata attributes. Statistics on the corpus level and each document level makes it easy to analyse and understand the content in the data. Users can import and analyse their own collections of data in Strix through Mink, and they can also get a collection of similar documents. The long term goal is to have all the open access data in Strix that is currently available in Korp.

__________________

Peter Ljunglöf is a researcher and lecturer in computational linguistics. He divides his time between Språkbanken and the Department of Computer Science and Engineering, Sweden. His main interest is language technology, i.e., how to get computers to understand human language. At the moment he mainly pursues two paths – grammar formalisms and quantitative text analysis. Peter’s interest in grammar formalisms is very general and can be summarised in the following question: "How grammars can be used to specify and encode not just linguistic syntax, but also other information?". His interest in text analysis is how it can be used for understanding socio-linguistic phenomena. See more on Peter’s webpage.

Title: Towards corpus algebra with precise semantics (...in collaboration with Nick Smallbone, Språkbanken Text & Niklas Deworetzki, Chalmers Univeristy of Technology)

Abstract. Researchers in digital humanities routinely work with text corpora, annotated text collections of up to billions of words. To find patterns in these corpora, they need search tools that can handle complex queries and huge amounts of text. But existing tools fail to perform well on complex queries, because of poor query optimisation. Query optimisation is hard because existing query languages are ad hoc and have no clear semantics.

We will create a principled foundation for corpus query languages: a well-behaved language, a precise semantics and clear algebraic properties that can be used for query optimisation. We will use this to develop practical algorithms for efficiently searching very large text corpora.

Inspired by relational algebra, we propose a corpus algebra with a precise semantics. Queries are compiled into corpus algebra, then transformed using algebraic laws into a more efficient form and executed.

In the project we will address the following research problems:

1. What is a suitable query language for corpus algebra?
2. Which query operators have a well-behaved semantics?
3. What laws can we use to optimise corpus algebra expressions?
4. What search indexes are useful, and how can they be incorporated into corpus algebra?

The algorithms we produce will find optimal query plans that use available search indexes well. Hence we expect that complex queries will run in seconds, compared to several minutes today. This will open up new research fronts in digital humanities.

Photos by Elena Volodina.

Approaches to corpus searches - a mini-workshop at Språkbanken Text

1. Workshop in a nutshell

2. What can we take from this?

3. Open questions

4. Speakers’ bios and abstraits

Labels