April, 23, 2024: Approaches to Corpus Searches.
Two guest talks (á 30 mins), two in-house mini-talks (á 5-10 mins) and a panel discussion (≈30 min)
- Time: 10:15-12:00
- Place: University of Gothenburg, Humanisten. Room J233
Talks:
Špela Arhar Holdt, University of Ljubljana, Slovenia
Johannes Graën, University of Zurich, Switzerland
Title: LCP -- the LiRI Corpus Platform
Abstract: Over the past three years, we have been developing a novel technology for querying corpora of different kinds. We found that, although plenty of specialized tools are freely available (CWB, ANNIS, NoSketchEngine, etc.), none of them are suitable for large corpora (> 1b tokens) or multimodal data (audio, video, images) effectively. Furthermore, the query languages supported by those tools vary in terms of expressiveness.
Our corpus platform LCP is designed to cater to the diverse needs of linguists and researchers from related fields. Its modular structure enables the creation of customized user interfaces on top of a shared infrastructure, while allowing users to import corpora with tailored structures.
Yousuf Ali Mohammed, Språkbanken Text, Sweden
Title: Strix -- for wiser text visualisation
Abstract: Strix is a text visualisation tool currently developed at Språkbanken Text. This tool gives the opportunity to researchers, teachers, students and others who work on text data, to visualise the whole document (or text) together with the annotations on text level, sentence level and word level. The current version of Strix has a simple search functionality (word or phrase) and filtering option based on the metadata attributes. Statistics on the corpus level and each document level makes it easy to analyse and understand the content in the data. Users can import and analyse their own collections of data in Strix through Mink, and they can also get a collection of similar documents. The long term goal is to have all the open access data in Strix that is currently available in Korp.
Peter Ljunglöf, Språkbanken Text, Sweden
...in collaboration with Nick Smallbone, Språkbanken Text & Niklas Deworetzki, Chalmers Univeristy of Technology
Title: Towards corpus algebra with precise semantics
Abstract: Researchers in digital humanities routinely work with text corpora, annotated text collections of up to billions of words. To find patterns in these corpora, they need search tools that can handle complex queries and huge amounts of text. But existing tools fail to perform well on complex queries, because of poor query optimisation. Query optimisation is hard because existing query languages are ad hoc and have no clear semantics.
We will create a principled foundation for corpus query languages: a well-behaved language, a precise semantics and clear algebraic properties that can be used for query optimisation. We will use this to develop practical algorithms for efficiently searching very large text corpora.
Inspired by relational algebra, we propose a corpus algebra with a precise semantics. Queries are compiled into corpus algebra, then transformed using algebraic laws into a more efficient form and executed.
In the project we will address the following research problems:
1. What is a suitable query language for corpus algebra?
2. Which query operators have a well-behaved semantics?
3. What laws can we use to optimise corpus algebra expressions?
4. What search indexes are useful, and how can they be incorporated into corpus algebra?
The algorithms we produce will find optimal query plans that use available search indexes well. Hence we expect that complex queries will run in seconds, compared to several minutes today. This will open up new research fronts in digital humanities.
Panel with the four speakers on approaches to corpus searches
Moderator: Elena Volodina
2024
Autumn-2024:
Spring-2024:
- January 10-11: Huminfra Conference (HiC 2024)
- April 21-28: Visit from Slovenia and Switzerland to continue working on corpus search interfaces and similar
2023
Autumn-2023:
- October 21-28: Visit from France for exchange on language profiling tools and approaches (Prof. Nuaria Gala)
- Visit from Slovenia for exchange of experiences on tools for annotation of learner essays and L2-specific search tools (Špela Arhar Holdt and others)
- Collaboration with Finland for training researchers from a new project to use annotation tools for L2 essay annotation, pseudonymization, annotation management and database storage (Therese Lindström Tiedemann and others)
Spring-2023:
- Språkbanken Text - HumInfra workshop on Profiling second language vocabulary and grammar
2022
Temadagen om Digital Humaniora vid Humanistiska fakulteten.
- Date: 7 november 2022.
- Organizers: CDH
- Participants: CDH, Språkbanken Text, LIR, HDK, UB, Inst för historiska studier, Inst. för kulturveteskaper
Talks:
- Jonas Ingvarsson – Välkommen och Kritikens nya ordning
- Mats Fridlund – Halvautomatisk historia
- Rachel Pierce – UB - KvinnSam på UB som resurs
- Rachel Pierce – UB - Forskning över gränsen mellan digitala och fysiska arkivsamlingar
- Johan Karlsson – UB:s kulturarvssamlingar
- Daniel Brodén – LIR and CDH - SweClarin
- Karin Wagner (zoom) – Estetik och CDH - Digitaliserade herbarier
- Niklas Zechner – SBX - Verktyg för analys av ordfrekvenser
- Elena Volodina – SBX - Digiala verktyg för analys svenska som andraspråk
- Johan Ling – Historiska studier – Svensk HällristningsForskningsarkiv, SHFA
- Olle Essvik – HDK – The Computer as Seen at the End of Human age
- Mats Fridlund och Daniel Brodén – Interdisciplinära multimodala terrorismstudier