Skip to main content
Språkbanken Text is a department within Språkbanken.

Approaches to corpus searches - a mini-workshop at Språkbanken Text

Submitted by Elena Volodina on 2024-05-21

1. Workshop in a nutshell

On April, 23, 2024 Språkbanken Text and Huminfra organized a well-attended mini-workshop Approaches to corpus searches. The workshop was devoted to showcasing new perspectives to corpus searches from different angles: techical, user-oriented, visualization. Four invited experts gave presentations, namely Špela Arhar Holdt from Slovenia, Johannes Graën from Switzerland, and Yousuf Ali Mohammed and Peter Ljunglöf from Sweden.

The audience.

Figure 1: The audience at the mini-workshop.

Špela’s talk introduced the audience to a tool designed for teachers of Slovene with support for pedagogically relevant types of searches in one specific corpus. The tool boasts a number of very attractive functionalities and allows to represent learner-written esssays in parallel: the original version and its correction, with a possibility to define searches in either of them.

Spela.

Figure 2: Špela Arhar Holdt is giving her presentation.

Johannes outlined a bigger picture of infrastructural components that have been recently implemented at the University of Zurich with the focus on one particular corpus search platform, LCP, the LiRI Corpus Platform. The strengths of the platform is that it allows to search multiple corpora in multiple modalities and supports queries in a new query language, i.e. not in CQL (corpus query language) that is the current standard in the field. Johannes demonstrated several searches and outlined future scenarios.

Johannes.

Figure 3: Johannes Graën is giving a presentation.

Samir.

Figure 4: Yousuf Ali Mohammed is talking about Strix, Språkbanken Text’s corpus search tool for on document level.

Yousuf Ali Mohammed (Samir) presented the current version of Strix for searches on the document level demonstrating various new visualization techniques, such as word rain for exploring word prominence and topicality of specific text, and visualization of statistics

 

2. What can we take from this?

The multitude of perspectives on corpus searches outlined in the four presentations have demonstrated that there are various open questions in the field.

Markus Forsberg, the director of Språkbanken Text, says that the fundamental question in this context is how much (richly annotated) data we want to be able to search in simultaneously, on the one hand, and selectively, on the other hand. Should it be just millions of tokens from one domain or billions (or more) from various domains? Do we always want to search in the whole dataset, or do we want to be able to select various subsets of the data before the search? And how homogeneous and comparable should these cross-domain corpora be, in that case? This asks for a further discussion of the rather problematic concept corpus, as defined by our tools.

Some of the demonstrated search interfaces were really nice, for example, the ones presented by Špela and Johannes, says Markus, but one potential downside is that they are tuned to particular data collections. Generalization over many corpora calls for a compromise in favor of, for example, common information categories that make the corpora interoperable. Which also might mean abandoning some of the specificities of each particular corpus. The question is how we should prioritize. Should we continue allowing broad searches in many corpora in Korp, as we do today, or opt for more specialization? Or can we do both?

This hints at the need to take a user perspective – do we know our users and their needs? What is a hypothetical ideal tool from their perspective? Špela Arhar Holdt believes "an ideal corpus search tool is one that ’ideally’ supports the user in their specific task. For some tasks, access to diverse types of corpora, advanced search functionalities, and various data visualization options is essential. However, in many scenarios, the simplicity of the search process and clarity of the user interface are more important than an abundance of features. For example, the primary audience for the concordancer I presented are primary and secondary school teachers. In our case, the ideal tool would address their typical needs without overburdening them with excessive functionality. The other two tools that were presented offer a wider variety of options, aiming to accommodate as many different search scenarios as possible. This is particularly useful for researchers who need a tool that can adapt to various (also) unforeseen research requirements."

This perspective – one tool for one corpus – highlights another problem, though, namely, that we are kind-of doing the "same" implementations on repeat, which is ineffective in many respects. In Špela’s opinion, "we are, to some extent, reinventing the wheel, as concordancers and other methods for corpus visualization have been common in our community for decades. The fundamental concepts behind these tools are largely similar." She brings another problem to the attention, namely, that rarely do we encounter preexisting solutions that are high-quality, user-friendly, and available under an open license; often, one of these aspects is lacking.

One of the desired features for corpus searches is the possibility to search on the document level. This area is less researched and best practices are yet to be established, which is why we should spend more efforts on trying out different approaches and refining them. This is where Samir’s (Yousuf’s) work with Strix is coming into the picture. In Samir’s words, the most prominent questions in this respect become (1) the semantic search – where a lot of research is still necessary, and (2) ways of visualization – where a lot of experimentation is still ahead. Creativity, thinking out of the box and application of LLMs are the way forward, in Samir’s opinion.

If we continue searching in large corpora collections (like in Korp and Strix), faster methods become critical. In Peter Ljunglöf’s words, abstracting from standards offered by the corpus workbench and experimenting with other approaches is necessary to achieve the desired results. And why not create a new stardard and replace the corpus workbench altogether (if necessary)? ...which is essentially what he and his colleagues are working on.

Finally, in the words of Špela, "maybe the absence of a perfect tool doesn’t mean we can’t create one, especially with wider international cooperation"?

3. Open questions

We still have a large number of questions to consider when it comes to corpus searches, some of which are summarized below, as a reminder to all of us interested in this area:

1. Ideal search tool (both user experience and technical implementation):

a) Should such an ideal tool be cut for particular corpus and its individual characteristics, or should it allow to search lots of different corpora simultaneously?

b) Do we tend to re-invent a wheel, kind-of "same" implementations on repeat? Is that effective?

c) What should be the compromise?

2. Who are the users? What do we know about their problems/issues and are there any isues you haven’t addressed from their point of view?

3. What are the common points of the work on corpus search interfaces? Are there any opportunities for integration?

4. The current standard in the field is corpus workbench.

a) How important is corpus workbench? (and why)

b) What are its advantages and disadvantages?

c) Is corpus workbench restrictive in some ways? (e.g. conversions from other formats, and loss of information; dependence on tokens, etc.)

d) What are the possile motives for sticking with it? (e.g. with regards to sizes of the data and effectiveness of searches)

5. Is there a need for a new standard to corpus searches? Is there a place for AI-based approaches to searches? Can we? Should we?

6. Currently, most corpus search tools are focused on annotations on a token level. But many who search are interested in larger text chunks, whole documents, text semantics or context.

a) How important is it to develop approaches to document searches?

b) How important is semantic search?

c) Would that type of search be possible with the corpus workbench?

7. Alignment and visualization.

a) The results of searches on the token level are visualized through KWIC (key word in context). How should results of document seaches be visualized?

b) How should results of semantic searches be visualized?

8. To summarize: Which approach(es) should we advocate in corpus search in the future? 

4. Speakers’ bios

Špela Arhar Holdt is a research associate at the Centre for Language Resources and Technologies (CJVT), University of Ljubljana. Her expertise lies in corpus linguistics, encompassing the development of openly accessible language resources and the establishment of efficient machine- and user-assisted lexicographic processes for modern Slovene. She co-authored more than 50 language resources at CLARIN.SI and holds a co-editor role for various lexical resources, including the Thesaurus of Modern Slovene, the Collocation Dictionary of Modern Slovene, and the Sloleks Morphological Lexicon. She is particularly interested in resources and tools designed for language learning. More information is on Špela’s homepage.

Johannes Graën built up the Language Technology section at LiRI, the Linguistic Research Infrastructure at the University of Zurich. Together with his team, he designed and implemented LCP, the LiRI Corpus Platform, which is a modular system for searching different types of corpora, from regular text corpora to multimodal ones. The core of LCP is a PostgreSQL database with an optimized representation of the respecive corpus’ data and indices thereon as well as a common backend that translates corpus queries into SQL corpus matching the respective corpus’ schema. More information is on Johannes’ old home page at LiRI

Yousuf Ali Mohammed (Samir) is a research engineer at Språkbanken Text, Sweden, where he primarily works on the Strix platform experimenting with novel approaches to document searches and visualiation. He has previously worked in the project L2 profiles as a systems developer. More information is on Samir’s webpage.

Peter Ljunglöf is a researcher and lecturer in computational linguistics. He divides his time between Språkbanken and the Department of Computer Science and Engineering, Sweden. His main interest is language technology, i.e., how to get computers to understand human language. At the moment he mainly pursues two paths – grammar formalisms and quantitative text analysis. Peter’s interest in grammar formalisms is very general and can be summarised in the following question: "How grammars can be used to specify and encode not just linguistic syntax, but also other information?". His interest in text analysis is how it can be used for understanding socio-linguistic phenomena. See more on Peter’s webpage.

Photos by Elena Volodina.