As the COVID-19 virus became a pandemic in March 2020, the amount of (time-stamped written) data, such as news/newspaper reports, scientific articles, social media posts (e.g. blogs and twitter), surveys and other information about the virus and its symptoms, prevention, management and transmission became massively available.
Artificial intelligence system dealing with (human) natural language rely on language models, predictions of which words occur together. To better understand how such models work -- and where they fail -- when applied to Swedish texts we need Swedish test data.
25-27 november gick den åttonde upplagan av SLTC, Swedish Language Technology Conference, av stapeln på Humanisten här i Göteborg. Eller, skulle ha gjort om inte ett visst virus satte stopp för det.
We at Språkbanken Text have just released a new corpus of native (L1) and non-native (L2) speech in four languages: English, Spanish, French and Italian. The corpus contains more than 170 million words produced by more than 97 thousand speakers (size varies a lot across the four languages, though).
This blog is based on the author's (Elena Volodina's) joint research with Yousuf (Samir) Ali Mohammed, Arild Matsson, Beáta Megyesi and Sandra Derbring
När vi tänker på ord så tänker vi oftast på enheter som i text omges av blanksteg (mellanrum): 'huset', 'superstor', 'bloggade'. De flesta skulle nog säga att 'idag' är ett ord, men hur är det om vi skriver det (också rättstavat) 'i dag' då? 'Mont Blanc-tunneln'? 'Röda blodkroppar'?
(This blog is based on a joint research and publication in collaboration with David Alfter, Therese Lindström Tiedemann, Maisa Lauriala and Daniela Piipponen)
Lars Borin, Anju Saxena, Shafqat Mumtaz Virk, Bernard Comrie
South Asia – comprising the seven countries Pakistan, India, Nepal, Bhutan, Bangladesh, Sri Lanka, and the Maldives, as well as immediately adjacent areas of neighboring countries (parts of Afghan
Sverige har en relativt lång tradition av att skapa en typ av korpus som brukar kallas trädbank. En trädbank är en samling texter som har annoterats (märkts upp) med ordklasser och syntaktisk struktur. Den syntaktiska strukturen för en mening kan ritas upp så att den liknar ett träd.