Hoppa till huvudinnehåll

Events

 


April, 23, 2024: Approaches to Corpus Searches. 

Two guest talks (á 30 mins), two in-house mini-talks (á 5-10 mins) and a panel discussion (≈30 min)

  • Time: 10:15-12:00
  • Place: University of Gothenburg, Humanisten. Room J233

Talks:

Špela Arhar Holdt, University of Ljubljana, Slovenia

Title: A specialised concordancer for corpora with annotated language corrections
 
Abstract: In this presentation, we introduce a new specialized concordancer for corpora with annotated language corrections (e.g. learner and developmental corpora). Through a variety of search scenarios, we'll show the tool’s capabilities, emphasizing how users can easily search and examine features of both the learner and the corrected texts. The concordancer serves as a complement to the Svala annotation system and the newly proposed XML TEI guidelines for these corpora, bridging the gap between data annotation and analysis. Warmly welcome!

Johannes Graën, University of Zurich, Switzerland

Title: LCP -- the LiRI Corpus Platform

Abstract: Over the past three years, we have been developing a novel technology for querying corpora of different kinds. We found that, although plenty  of specialized tools are freely available (CWB, ANNIS, NoSketchEngine, etc.), none of them are suitable for large corpora (> 1b tokens) or  multimodal data (audio, video, images) effectively. Furthermore, the  query languages supported by those tools vary in terms of expressiveness.

Our corpus platform LCP is designed to cater to the diverse needs of  linguists and researchers from related fields. Its modular structure  enables the creation of customized user interfaces on top of a shared  infrastructure, while allowing users to import corpora with tailored structures.

Yousuf Ali Mohammed, Språkbanken Text, Sweden

Title: Strix -- for wiser text visualisation

Abstract: Strix is a text visualisation tool currently developed at Språkbanken Text. This tool gives the opportunity to researchers, teachers, students and others who work on text data, to visualise the whole document (or text) together with the annotations on text level, sentence level and word level. The current version of Strix has a simple search functionality (word or phrase) and filtering option based on the metadata attributes. Statistics on the corpus level and each document level makes it easy to analyse and understand the content in the data. Users can import and analyse their own collections of data in Strix through Mink, and they can also get a collection of similar documents. The long term goal is to have all the open access data in Strix that is currently available in Korp.

Peter Ljunglöf, Språkbanken Text, Sweden

...in collaboration with Nick Smallbone, Språkbanken Text &  Niklas Deworetzki, Chalmers Univeristy of Technology

Title: Towards corpus algebra with precise semantics  

Abstract: Researchers in digital humanities routinely work with text corpora, annotated text collections of up to billions of words. To find patterns in these corpora, they need search tools that can handle complex queries and huge amounts of text. But existing tools fail to perform well on complex queries, because of poor query optimisation. Query optimisation is hard because existing query languages are ad hoc and have no clear semantics.

 We will create a principled foundation for corpus query languages: a well-behaved language, a precise semantics and clear algebraic properties that can be used for query optimisation. We will use this to develop practical algorithms for efficiently searching very large text corpora.

Inspired by relational algebra, we propose a corpus algebra with a precise semantics. Queries are compiled into corpus algebra, then transformed using algebraic laws into a more efficient form and executed.

 In the project we will address the following research problems:

1. What is a suitable query language for corpus algebra?
2. Which query operators have a well-behaved semantics?
3. What laws can we use to optimise corpus algebra expressions?
4. What search indexes are useful, and how can they be incorporated into corpus algebra?

 The algorithms we produce will find optimal query plans that use available search indexes well. Hence we expect that complex queries will run in seconds, compared to several minutes today. This will open up new research fronts in digital humanities.

Panel with the four speakers on approaches to corpus searches 

Moderator: Elena Volodina

 


2024

Autumn-2024:

  •  

Spring-2024:

  • January 10-11: Huminfra Conference (HiC 2024)
  • April 21-28: Visit from Slovenia and Switzerland to continue working on corpus search interfaces and similar

2023

Autumn-2023:

  • October 21-28: Visit from France for exchange on language profiling tools and approaches (Prof. Nuaria Gala)
  • Visit from Slovenia for exchange of experiences on tools for annotation of learner essays and L2-specific search tools (Špela Arhar Holdt and others)
  • Collaboration with Finland for training researchers from a new project to use annotation tools for L2 essay annotation, pseudonymization, annotation management and database storage (Therese Lindström Tiedemann and others)

Spring-2023:


 

2022

Temadagen om Digital Humaniora vid Humanistiska fakulteten.

  • Date: 7 november 2022.
  • Organizers: CDH
  • Participants: CDH, Språkbanken Text, LIR, HDK, UB, Inst för historiska studier, Inst. för kulturveteskaper

Talks:

  • Jonas Ingvarsson – Välkommen och Kritikens nya ordning
  • Mats Fridlund – Halvautomatisk historia
  • Rachel Pierce – UB - KvinnSam på UB som resurs
  • Rachel Pierce – UB - Forskning över gränsen mellan digitala och fysiska arkivsamlingar
  • Johan Karlsson – UB:s kulturarvssamlingar
  • Daniel Brodén – LIR and CDH - SweClarin
  • Karin Wagner (zoom) – Estetik och CDH - Digitaliserade herbarier
  • Niklas Zechner – SBX - Verktyg för analys av ordfrekvenser
  • Elena Volodina – SBX - Digiala verktyg för analys svenska som andraspråk
  • Johan Ling – Historiska studier – Svensk HällristningsForskningsarkiv, SHFA
  • Olle Essvik – HDK – The Computer as Seen at the End of Human age
  • Mats Fridlund och Daniel Brodén – Interdisciplinära multimodala terrorismstudier