Collections
Die Große Transformation
Optional description not given for resource Die Große Transformation
Austrian Centre for Digital Humanities and Cultural Heritage - A Resource Centre for the HumanitiEs
German
Edierte Verlassenschaftsabhandlungen Geistlicher aus dem mittleren Pustertal
Optional description not given for resource Edierte Verlassenschaftsabhandlungen Geistlicher aus dem mittleren Pustertal
Austrian Centre for Digital Humanities and Cultural Heritage - A Resource Centre for the HumanitiEs
German
Karl Baedeker: Das Mittelmeer. Handbuch für Reisende: Digital Edition
Optional description not given for resource Karl Baedeker: Das Mittelmeer. Handbuch für Reisende: Digital Edition
Austrian Centre for Digital Humanities and Cultural Heritage - A Resource Centre for the HumanitiEs
German
Karl Baedeker: Indien. Handbuch für Reisende: Digital Edition
Optional description not given for resource Karl Baedeker: Indien. Handbuch für Reisende: Digital Edition
Austrian Centre for Digital Humanities and Cultural Heritage - A Resource Centre for the HumanitiEs
German
Karl Baedeker: Konstantinopel und Kleinasien. Handbuch für Reisende: Digital Edition
Optional description not given for resource Karl Baedeker: Konstantinopel und Kleinasien. Handbuch für Reisende: Digital Edition
Austrian Centre for Digital Humanities and Cultural Heritage - A Resource Centre for the HumanitiEs
German
Karl Baedeker: Nordamerika. Handbuch für Reisende: Digital Edition
Optional description not given for resource Karl Baedeker: Nordamerika. Handbuch für Reisende: Digital Edition
Austrian Centre for Digital Humanities and Cultural Heritage - A Resource Centre for the HumanitiEs
German
Karl Baedeker: Palästina und Syrien. Handbuch für Reisende: Digital Edition
Optional description not given for resource Karl Baedeker: Palästina und Syrien. Handbuch für Reisende: Digital Edition
Austrian Centre for Digital Humanities and Cultural Heritage - A Resource Centre for the HumanitiEs
German
Orts-, Personen- und Bücherregister
Optional description not given for resource Orts-, Personen- und Bücherregister
Austrian Centre for Digital Humanities and Cultural Heritage - A Resource Centre for the HumanitiEs
German
Statistische Auswertungen von Verfachbüchern mit Inventaren
Optional description not given for resource Statistische Auswertungen von Verfachbüchern mit Inventaren
Austrian Centre for Digital Humanities and Cultural Heritage - A Resource Centre for the HumanitiEs
German
Volltexte der Verfachbücher
Optional description not given for resource Volltexte der Verfachbücher
Austrian Centre for Digital Humanities and Cultural Heritage - A Resource Centre for the HumanitiEs
German
aGender – Homepage
The speech corpus aGender contains speech sample recordings over public telephone lines with read and (semi-)spontaneous speech. Native German speakers called a voice portal from their private phone, and read text + answered some open questions. The purpose of the corpus is the automatic detection of gender and/or age (7 mixed classes ranging from 7 - 80 years). The corpus contains the voices of 945 German speakers (approx. minimum of 100 speakers per class), each delivering 18 speech items in up to six different sessions. The time/date of the individual recordings sessions were not controlled, neither the total number of sessions per speaker. The audio signal was recorded over standard cell phones (GSM standard) and landline connections in 8000 Hz, 8 bit alaw format. Data were then expanded to 8000Hz, 16bit PCM (13 bits are valid!). The selection of speakers is approximately evenly distributed over the seven target classes, with class 1 also being balanced for gender. The read material consists of an altered version of the SpeechDat text material, containing short fixed and free text typical for automated call centers. A typical utterance is about 2 seconds in length, but there are also some utterances are between 3 and 6 seconds. In total, the corpus consists of 47 hours of speech. Two sets were defined on that data: A training set (81.5%) and a test set (175 speakers, 25 per class, 18.5%), each with disjunctive speaker sets. For the test set no class information is given in this corpus. Refer to Section Evaluation on how to receive an evaluation from Telekom Labs. Users of this speech corpus are required to report any scientific publications based on these data to Felix Burkhardt (Felix.Burkhardt@telekom.de).
Bayerisches Archiv für Sprachsignale
German
Audioatlas Siebenbuergisch-Saechsischer Dialekte – Homepage
Large set of 2274 recordings (approx. 360h) of spoken dialectal German (Saxonian) recorded in Transilvania (Romania) in approx. 250 different locations. This up-to-now unpublished material has been collected on analog tape in the 1960s and 70s by different linguists based at the universities of Bukarest, Hermannstadt and Klausenburg. Later these tapes have been digitized, and in 2009 - with the kind support of Prof. Dr. Stefan Sienerth, director of the Institut für deutsche Kultur und Geschichte Südosteuropas (IKGS) transferred to the Institute for Romance Studies (Prof. Thomas Krefeld; http://www.romanistik.uni-muenchen.de/personen/professoren/krefeld/index.html) and the LMU Center for Digital Humanities (IT-Gruppe Geisteswissenschaften [ITG; http://www.itg.lmu.de]; Dr. Stephan Lücke, Emma Mages) respectively, both of the Ludwig-Maximilians-Universität München (LMU). The corpus comprises different recording strategies and discourse types: on the one hand the classic German Wenker sentences, on the other hand also fairy tales, song texts and free story-telling. Insofar, the corpus not only provides historical linguistic data but also input for ethnographical and historical disciplins. Since the age of the informants varies over a large range, this gives another dimension reflected in the metadata of the corpus. Further corpus features: geo reference of all recording sites; phonetic transcription of Wenker sentence recordings; orthographic transcription of spontaneous speech recordings (approx. 450.000 words); (partly) phonetic transcription of spontaneous speech; semantic labelling (ontology); extension to middle Bavarian recordings from the area Wassertal/Oberwischau. The ASD corpus can also be accessed at http://www.asd.gwi.uni-muenchen.de/, a dedicated website providing numerous kinds of tools, analytic approaches and visualisation. In 2016 the present corpus version was created at the BAS CLARIN center of the Ludwig-Maximilians-Universität München for indefinite achivation and distribution.+Umfangreiche Tondokumentation siebenbürgisch-sächsischer Dialekte mit insgesamt um über 360 Stunden gesprochener Sprache aus ca. 250 verschiedenen Ortschaften, gespeichert in insgesamt 2274 Audiodateien. Dieses einzigartige, bislang unveröffentlichte Material wurde im Wesentlichen in den späten 60er und frühen 70er Jahren des letzten Jahrhunderts von Sprachforschern verschiedener rumänischer Universitäten (Bukarest, Hermannstadt, Klausenburg) erhoben und auf Tonbändern aufgenommen; daraus wurden die digitalen Versionen mit durchweg guter akustischer Qualität erstellt. Diese digitalen Versionen wurden im Jahr 2009 unter Vermittlung und mit Unterstützung von Prof. Dr. Stefan Sienerth, dem damaligen Direktor des Instituts für deutsche Kultur und Geschichte Südosteuropas (IKGS), an die LMU übergeben. Die Dokumentation umfasst unterschiedliche Erhebungsstrategien und Diskursformen: Einerseits wurden in den meisten Orten die âklassischenâ Wenker-Sätze der germanistischen Dialektologie abgefragt; andererseits sind auch Märchen, Lieder und â vor allem â zahlreiche, mehr oder weniger freie Erzählungen vertreten. Deshalb werden neben sprachwissenschaftlichen auch ethnographische und zeitgeschichtliche Interessen bedient. Mit dem sehr unterschiedlichen Alter der Informanten kommt eine weitere Dimension der Variation ins Spiel, die ebenfalls gezielt abgefragt werden kann. Weitere Korpus-Merkmale sind: Georeferenzierung der Erhebungsorte; phonetische Transkription (IPA) der Wenkersatzaufnahmen; standardnahe, orthographische Transkription spontansprachlicher Texte (475.000 Wörter, entsprechend über 300 Normseiten); phonetische Transkription von Spontansprache, inhaltliche TiefenerschlieÃung durch Verschlagwortung (âOntologieâ), geographische und sprachliche Ausweitung Mittelbayerisches Material aus dem Wassertal/Oberwischau. Die vorliegende Version des ASD Korpus wurd im Jahre 2016 am BAS CLARIN center der Ludwig-Maximilians-Universität München archiviert.
Bayerisches Archiv für Sprachsignale
Bavarian, German, Romanian, Undetermined
BAS AbsolventInnen – Homepage
The Absolventinnen corpus has been recorded during the summer semester 2019 within the scope of the Masterâs thesis âAussprachestrategien geschlechtsneutraler Nomina im Deutschenâ by Korbinian Slavik at the Institute of Phonetics and Speech Processing (IPS), LMU Munich. It is the proceeding corpus of the SprecherInnen corpus, which is also part of the BAS CLARIN Repository. Its purpose is to provide data for examining the pronunciation of gender-neutral forms in German in a main study. To this date, there is a wide variety of written forms in order to express different biological genders in one word. These forms include the asterisk *, underscore _, and internal I before the suffix, e.g., Absolvent*innen, Absolvent_innen, and AbsolventInnen. To our knowledge, there is no standardized pronunciation norm for these innovative writings which is why the collected data is of special interest to phoneticians and phonologists. Pronunciation strategies include a variety of morphosyntactic expansions, lengthening of [ɪ] in the suffixes, shift of lexical stresses, pauses, and glottal stops before suffix. Young speakers tend to use phonetic markers more often than older people, which could be an indicator for a potential change in progress. Additionally, participants tend to make more mistakes in sentences with gender neutral words. The recordings took place at the IPS in the Munich region. 56 texts were recorded from 40 speakers. The texts came from newspapers, websites, administration offices, social services, etc., and were modified to contain either one of the three gender-neutral forms or the extended form. Each of the speakers read the 56 sentences, with target words, 25 % each, asterisk, underscore, uppercase-I or the feminine plural-form in a counterbalancing measures design. Filler sentences for this study are not a part of the corpus but will be part of further investigations. That means, that there are 56 recordings per session. After the recording session participants filled in an online questionnaire, which will be part of the metadata. The participants were males and females, most of them from Bavaria or other parts of Germany, 20 students and 20 retired persons. All in all, there are 2240 recordings, all of which were transcribed orthographically, phonemically, and phonetically. In 2019, the present corpus version was created at the BAS CLARIN center of the University of Munich (LMU) for indefinite archivation and distribution.
Bayerisches Archiv für Sprachsignale
German
BAS Alcohol Language Corpus – Homepage
This corpus contains recordings of 162 speakers while being sober and intoxicated. Beginning with version 3, this corpus edition also contains an emuR compatible database version of the corpus (with a minor bugfix in the database in version 3.1).
Bayerisches Archiv für Sprachsignale
German
BAS CLIPS_MT_MANUAL – Homepage
CLIPS_MT_MANUAL is a sub-corpus of the original Italian CLIPS corpus (Corpora e Lessici dellItaliano Parlato e Scritto) that covers only 15 maptask dialogues recorded in 15 locations by local speaker pairs. The BAS has decided to bring forward another edition of this data as we found a large number of errors (formal and content) in the annotation and signal files of the original corpus that prevented our colleagues from performing proper phonetic investigations. To make published results on these (corrected) data replicable for the scientific community, BAS decided - with the kind permission of the CLIPS copyright holders - to ingest this part of CLIPS in the BAS CLARIN repository under the name CLIPS_MT_MANUAL (MT = map task, MANUAL indicates manual annotation), thereby making it available to all European academic researchers. In a nutshell, this corpus contains 3228 inspected and partially repaired WAV signal files, each containing one dialogue turn (*.wav), 3228 corrected original CLIPS annotation files (*.acs, *.phn, *.std, *.wrd), 3228 BAS Partitur files containing the annotation tiers ORT, KAN and SAP (*.par), 3228 EMU database annotation files (*.vot, *.hlb) covering 30 maptask dialogues performed by 30 speakers (each speaker pair performing two different map tasks) recorded in 15 different locations in Italy in 2000-2004. Starting with version 1.2, the corpus is also provided in an emuR compatible json format (*_annot.json).
Bayerisches Archiv für Sprachsignale
Italian
BAS Database for Signer-Independent Continuous Sign Language Recognition – Homepage
The SIGNUM Database contains both isolated and continuous utterances of various signers. Since we use a vision-based approach for sign language recognition the corpus was recorded on video. For quick random access to individual frames, each video clip is stored as a sequence of images. The vocabulary comprises 450 basic signs in German Sign Language (DGS) representing different word types. Based on this vocabulary, overall 780 sentences were constructed. Each sentence ranges from two to eleven signs in length. No intentional pauses are placed between signs within a sentence, but the sentences themselves are separated. The entire corpus, i.e. all 450 basic signs and all 780 sentences, was performed once by 25 native signers of different sexes and ages. One of them was chosen to be the so-called reference signer. His performances were recorded not once but even three times. The SIGNUM Database was created within the framework of a research project at the Institute of Man-Machine Interaction, located at the RWTH Aachen University in Germany. The SIGNUM (Signer-Independent Continuous Sign Language Recognition for Large Vocabulary Using Subunit Models) project was funded by the Deutsche Forschungsgemeinschaft (German Research Foundation) and aimed to develop a video-based automatic sign language recognition system.
Bayerisches Archiv für Sprachsignale
German Sign Language
BAS Edition of German Distant Speech Data Corpus 2014/2015 – Homepage
General information: <br/> The corpus contains read German speech of 179 different speakers (50 female, 129 male). Each speaker has read randomly selected sentences from four<br/> text collections: Wikipedia, the Europarl Corpus,a list of German Command/Control sentences, a corpus of web-crawled sentences that represent direct speech. <br/> The recording took place at the Language Technology and Telecooperation labs, TU-Darmstadt, Germany in 2014-2015. <br/> The task for the speaker was to read fluently and precise (no dialectal variation). Up to 5 microphones were recorded in parallel: <br/> Kinect 1 Beamformed Audio signal through Kinect SDK, Kinect 1 Direct Access as normal microphone, Internal Realtek Mic of Asus PC - near noisy fan, Samson C01U, Yamaha PSG-01S. <br/> Distance to mouth for all microphones was approx. 100cm. Room: dry acoustics (quiet office), no noise. Sampling rate: 16kHz, resolution: 16 Bit.<br/> The speech data was collected in a controlled environment (same room, same microphone distances, etc.). <br/> Each recording has a xml transcription file that also includes speaker meta data. <br/> The data is curated (manually checked and corrected), to reduce errors and artefacts. <br/> The speech data is divided into three independent data sets: Training / Test / Dev, Test and Dev contains new sentences and new speakers <br/> that are not part of training set, in order to assess model quality in a speaker-independent open-vocabulary setting. <br/> Information about the data collection procedure: <br/> (1) Train set (recordings in 2014): <br/> Sentences were randomly chosen from German Wikipedia and Europarl Corpus, to be read by the speakers. <br/> The Europarl corpus (Release v7) is a collection of the proceedings of the European Parliament between 1996 and 2011, generated by Philipp Koehn <br/> (Europarl: A Parallel Corpus for Statistical Machine Translation, Philipp Koehn, MT Summit 2005, http://www.statmt.org/europarl/). <br/> As third data set, German command and control sentences, were manually specified and would be typical for a command and control setting in living rooms.<br/> (2) Test/dev set (recordings in 2015): <br/> Additional sentences from the German Wikipedia and from the Europarl Corpus have selected for the recordings. Additionally, we collected German sentences <br/> from the web by crawling the German top-level-domain and applying language filtering and deduplification. Exclusively sentences starting with quotation marks <br/> were selected and randomly sampled. The three text sources are represented with approximately equal amounts of recordings in the test/dev set.
Bayerisches Archiv für Sprachsignale
German
BAS FORMTASK – Homepage
FORMTASK is a telephone speech database of prompted descriptions of typical forms found in everyday life. The forms are<br/> - Berlin public transport ticket<br/> - Invoice<br/> - Austrian parking ticket<br/> - Newsstand receipt<br/> - Money transfer form<br/> To elicit a description of the forms, the following four questions were asked:<br/> 1) What type of form is this?<br/> 2) What date is on the form?<br/> 3) What amout is on the form?<br/> 4) Where is the amount written on the form?<br/> The speakers saw the form in black and white print on their personal prompt sheet on paper. Starting from version 2.1, FORMTASK is distributed as an emuR compatible emuDB corpus.
Bayerisches Archiv für Sprachsignale
German
BAS HEMPEL – Homepage
Hempels Sofa is a collection of more than 3900 spontaneous speech items recorded as extra material during the German SpeechDat-II project. Speakers were asked to report what they had been doing during the last hour: Was haben Sie in der letzten Stunde gemacht?. This item was recorded as the last item of the recording session. Speakers had become acquainted with the recording procedure and they were quite relaxed because they knew that this item was the last to be recorded. This resulted in quite natural, colloquial speech, sometimes with marked regional accent. The corpus collection is described in more detail in the LREC2002 paper Three New Corpora at the Bavarian Archive for Speech Signals - and a First Step Towards Distributed Web-Based Recording by C. Draxler and F. Schiel. This paper is contained in this database in file DOC/BASCORPO.PDF; it also contains links to related SpeechDat documents. Note: the name of the corpus refers to the German proverbial phrase wie bei Hempels unterm Sofa. This phrase is often used to indicate that something is not well cleaned-up -- not dirty, just in its everyday state when one is not expecting visitors. I thought the phrase to be appropriate for this data collection because quite often when listening to the recordings one gets the impression of sitting next to the speaker on the sofa in a common living room. Note: Starting from version 2.0 (CLARIN Repository Version 3), HEMPEL is distributed as an emuR compatible emuDB. Version 2.1 (CLARIN Repository 4) is distributed without the MAU (phonetic segmentation) tier, as it was found to lack in accuracy.
Bayerisches Archiv für Sprachsignale
German
BAS Infrastructures for Technical Speech Processing – Homepage
Speech synthesis using concatenative techniques is maturing to a point where standard procedures are being implemented in a variety of products. However, because of the considerable costs most small and medium-sized companies as well as university labs cannot afford to produce the required speech resources on their own. Although there are some public domain German diphone voices available for research purposes (e.g. MBROLA) there is definitely a lack of publicly available synthesis resources. The BITS synthesis corpus (recorded and) produced by BAS fills the obvious gap.
Bayerisches Archiv für Sprachsignale
German
BAS Regional Variants of German - Juveniles – Homepage
The RVG-J Corpus (Regional Variants of German - Junior) was recorded in 2001 at the Institute of Phonetics and Speech Communication at the University of Munich, Germany. The corpus contains both read and non-scripted German utterances. It comprises the original RVG prompts (telephone numbers, sentences, commands, digits, etc.) plus spellings, date and time expressions, and free form responses to questions, e.g. What are you wearing?, How did you get here?, etc. The speakers were adolescents between 13 and 20 years of age, recruited in public schools in Munich and the suburbs. More than 95% of the speakers have German as their mother language, and almost all of them attended school in Bavaria; 89 of them were male and 93 female. Speakers younger than 18 years were required to provide a waiver signed by their parents stating that they were allowed to participate in the recordings. The corpus can be used for the training of speech recognisers or analyses of adolescent speech. Starting from version 1.5 (CLARIN Repository version 2), RVG-J is also distributed as an emuDB.
Bayerisches Archiv für Sprachsignale
German
BAS RVG1_CLARIN – Homepage
The corpus is a collection of more than 500 speakers of different dialect regions of Germany. The recordings were made using four different microphones (two in low and two in high quality) and consist of single digits, connected digits, phone numbers, phonetically balanced sentences, computer command phrases prompted on a screen, and 1 min spontaneous speech (monologue). The speakers were recorded in normal office environments. The backround noise was limited to the usual noise in office environment, eg. door slam, backround crosstalk, phone ringing, paper rustle, PC noise, etc. Starting from version 4.2 (BAS CLARIN Repository version 3), this corpus is distributed as an emuR compatible emuDB.
Bayerisches Archiv für Sprachsignale
German
BAS SC1 – Homepage
The corpus contains speech of 88 different speakers, reading the German story Der Nordwind und die Sonne. Subcorpus T contains the recordings of 16 native Germans (L1). The other 72 speakers which were born and educated in other countries (L2) are pooled in subcorpus C. Every speaker has a distinct accent. This corpus may be used for several tasks:<br/> - automatic accent detection.<br/> - test of robustness against different accents in automatic speech recognition.<br/> - scientific investigation of accents in German.<br/> Subcorpus T may be used as a reference or training corpus for technical evaluations. These signals are marked with a T in the speaker information file. These recordings and the respective annotations can be found in the phondat corpus as well. Starting from Version 1.3 (CLARIN Repository version 2), this corpus is distributed as an emuDB. This emuDB contains an automatic phonetic segmentation for all speakers (level MAU) as well as a manual phonetic segmentation for subcorpus T (level PHO). Version 3 is identical to version 2 with the exception of a bugfix in the emuDBs MAU level.
Bayerisches Archiv für Sprachsignale
German
BAS SC10 – Homepage
The SC10 corpus contains read and non-prompted German and mother tongue speech of 70 different speakers from 17 mother tongues (L1) in a variety of speaking styles e.g. reading, retelling, free talk etc. Starting from version 1.5 (BAS CLARIN repository version 3), the corpus is distributed as an emuDB. BAS CLARIN repository version 4 is identical to version 3 with the exception of a bugfix in the emuDBs MAU level.
Bayerisches Archiv für Sprachsignale
German
BAS SI100 – Homepage
The corpus contains read speech of 101 different speakers (50 female, 50 male, 1 unknown). Each speaker has read approx. 100 sentences from either the SZ subcorpus or the CeBit subcorpus. The language is German. The subcorpus SZ contains 544 sentences from newspaper articles (Sueddeutsche Zeitung). The subcorpus CeBit contains 483 sentences from newspaper articles about the CeBit 1995. Each subcorpus is divided into 5 parts of approx. 100 utterances each. Every speaker read only one part of one subcorpus (with some exceptions), thus resulting in a total of 10.387 recorded utterances (31,5 h of speech). The recording took place at the Institut fuer Phonetik, University of Munich, Germany in 1995.
Bayerisches Archiv für Sprachsignale
German
BAS SI1000 – Homepage
The corpus contains read speech of 10 different speakers. Each speaker has read approx. 1000 sentences from a German news paper corpus, thus resulting in a total of approx. 10000 recorded utterances. The recording took place at the Institut fuer Phonetik, University of Munich, Germany in 1994.
Bayerisches Archiv für Sprachsignale
German
BAS Siemens Hoergeraete Corpus – Homepage
Corpus of spontaneous, relatively casual dialogues in German. Each pair of dialogue partners is recorded conversing under real-noise conditions (in a noisy cafeteria and in a car going at different velocities), as well as in a studio at various levels of lombard noise played directly into the subjects ears. Starting from version 2.1 (BAS Clarin Repository version 2), this corpus is also distributed as an emuR compatible emuDB.
Bayerisches Archiv für Sprachsignale
German
BAS SmartWeb Video – Homepage
The SMARTWEB UMTS data collection was created within the publicly funded German SmartWeb project in the years 2004 - 2006. It comprises a collection of user queries to a naturally spoken Web interface with the main focus on the soccer world series in 2006. The recordings include 156 field recordings using a hand-held UMTS device (one person, SmartWeb Handheld Corpus SHC), 99 field recordings with video capture of the primary speaker and a secondary speaker (SmartWeb Video Corpus SVC) as well as 36 mobile recordings performed on a BMW motorbike (one speaker, SmartWeb Motorbike Corpus SMC). An addendum DVD-R (dvd-fau, vol 24) contains additional data derived from the basic SVC corpus data provided by FAU Erlangen. Starting from version 3.6 (CLARIN Repository version 4), the transcribed parts of the SVC audio recordings are distributed as an emuDB.
Bayerisches Archiv für Sprachsignale
German
BAS Strange Corpus 2 Noises – Homepage
The corpus SC2 contains read speech of 10 different speakers with screen prompted automobil diagnosis phrases recorded under real conditions in two different car maintenance halls. The language is German. All speakers are male native Germans and have never participated in such a task before. They are all experts in the field of car diagnosis. Each speaker has spoken 800 3-7 word utterances derived from 100 different sentences (see sc2_ort.txt) resulting in a total of 8000 utterances. Starting from version 2.4 (BAS CLARIN repository version 2), the corpus is distributed as an emuDB. In BAS CLARIN repository version 4, the emuDBs MAU tier was deleted, due to issues with segmentation quality stemming from the background noise in this corpus.
Bayerisches Archiv für Sprachsignale
German
BAS TAXI – Homepage
The TAXI dialog database was created in June 2001 in collaboration with the DFKI, Saarbruecken. TAXI contains 86 recorded dialogues between a cab dispatcher and a client recorded over public phone lines (network and GSM). The dispatcher always speaks German, while the clients always speaks English. Starting from version 2.5 (BAS CLARIN Repository version 3), TAXI is distributed as an emuR compatible emuDB.
Bayerisches Archiv für Sprachsignale
English, German
BAS Thesis data Veronika Neumeyer: CI Articulation – Homepage
This corpus contains speech recordings of normal hearing speakers and speakers equipped with Cochlear Implants (CI), as used for analysis in the Master thesis of Veronika Neumeyer (2009, LMU München). Speech data were collected with the software SpeechRecorder, for each recording a BPF file was generated (*.par), on which the MAUS segmentation was based (*.TextGrid). Starting with version 1.2, this corpus is distributed as an emuR compatible EMU database (files ending in *_annot.json).
Bayerisches Archiv für Sprachsignale
German
BAS Verbmobil 1 – Homepage
The Verbmobil (VM) dialog database is a collection of German, American and Japanese dialog recordings in the appointment scheduling task. The data were collected during the first phase (1993 - 1996) of the German VM project funded by the German Ministry of Science and Technology (BMBF). Starting with version 3, the corpus is also provided as an emuR comptatible database.
Bayerisches Archiv für Sprachsignale
English, German, Japanese
BAS Verbmobil 2 – Homepage
Verbmobil 2 contains the speech of 401 speakers participating in 810 recordings. The emotional tagged recordings are not part of this edition but are collected inthe corpus BAS VMEmo. The total VM2 corpus amounts to 17.6GB of data containing 58961 conversational turns distributed on 39 CD-R. VM2 contains dialogs in German, English, Japanese and mixed language pairs (partly with interpreter). The domain is appointment scheduling, travel planing, leisure time planing. Starting from version 3, the corpus is also available in emuR compatible emuDB format (see annotation files ending in *_annot.json).In Version 4 the accompanying CLARIN Documentation has been extended by the .rpr files for each session.
Bayerisches Archiv für Sprachsignale
English, German, Japanese
BAS Verbmobil Emotion – Homepage
This database contains speech signals of dialogues in which a subject was recorded during a conversation via a spontaneous speech translation system. The response of the system was designed to invoke emotions (e.g. anger) in the subjects. It is part of the larger Verbmobil 2 speech data collection. Starting from BAS Clarin Respository version 2, the database is also distributed as an emuR comptatible emu database.
Bayerisches Archiv für Sprachsignale
German
BAS VERIF1DE – Homepage
The VERIF1DE database is a subset of the VERIDAT speaker verification database collected by T-Nova. VERIDAT contains additional items and re-recordings of missing, corrupted, or otherwise unusable files in VERIF1DE. Please refer to the file DESIGN.PDF in the documentation package of this corpus for a detailed description of VERIF1DE. Users of this speech corpus are required to report any scientific publications based on these data to Felix Burkhardt (Felix.Burkhardt@telekom.de).
Bayerisches Archiv für Sprachsignale
German
BAS ZIPTEL – Homepage
The ZipTel telephone speech database contains recordings of people applying for a SpeechDat prompt sheet via telephone. For the SpeechDat data collection, calls for participation were published in phone, the customer magazine of the mobile telephone provider e-plus, and in numerous newspapers all over Germany. In these calls, a telephone number was given where callers could order a SpeechDat prompt sheet. The calls were recorded by an automatic telephone server; callers were asked to provide name, address and telephone number. The ZipTel telephone speech database consists of 1957 recording sessions with a total of 7746 signal files. A recording session corresponds to one phone call, each signal file contains a single recorded utterance from the recording session. For privacy reasons, only a subset of the recorded signal files are contained in the databases: Streetnames (z2), ZIP-Codes (z3), Citynames (z4) and Telephone numbers (z5). Starting from version 1.3 (BAS CLARIN Repository version 3), ZIPTEL is distributed as an emuDB.
Bayerisches Archiv für Sprachsignale
German
Bielefeld Speech and Gesture Alignment Corpus – Homepage
The primary data of the SaGA corpus are made up of 25 dialogs of interlocutors (50), who engage in a spatial communication task combining direction-giving and sight description. Six of those dialogues with data only from the direction giver are available including audio (*.wav) and video (*.mp4) data. The secondary data consists of annotations (*.eaf) of gestures and speech-gesture referents, which have been completely and systematically annotated based on an annotation grid (cf. the SaGA documentation). The corpus is comprised of of 9881 isolated words and 1764 isolated gestures.<br/> The stimulus is a model of a town presented in a Virtual Reality (VR) environment. Upon finishing a âbus rideâ through the VR town along five landmarks, a router explained the route as well as the wayside<br/> landmarks to an unknown and naive follower.<br/> The SaGA Corpus was curated for CLARIN as part of the Curation Project Editing and Integration of Multimodal Resources in CLARIN-D by the CLARIN-D Working Group 6 Speech and Other Modalities.
Bayerisches Archiv für Sprachsignale
German
Dissertation Data Dr. Veronika Neumeyer: Cluster Production in Cochlear Implant Patients (diachronic data) – Homepage
The CI_3 corpora contain diachronic speech recordings from three cochlear implant (CI) users which were analysed in the long term study part of Veronika Neumeyers PhD Thesis Akustische Analysen der Sprachproduktion von CI-Trägern (2015). Please note that the corpus is distributed as four separate subcorpora (CI_3_Cluster, CI_3_Sibilants, CI_3_VOT, CI_3_Vowels). For data used in the corresponding synchronic study, please refer to the CI_2 corpora. CI_3_Cluster contains recordings used for the analysis of the temporal dynamics of the consonant cluster /Êtr/. The data was recorded using SpeechRecorder and automatically segmented using MAUS, followed by a manual correction of the target phoneme(s). The database is distributed as an emuR compatible database (emuDB format). Version 2: Fixed a bug in the emuR DB config.
Bayerisches Archiv für Sprachsignale
German
Dissertation Data Dr. Veronika Neumeyer: Consonant Cluster Production in Cochlear Implant Patients – Homepage
The CI_2 corpora contain German speech recordings of 48 cochlear implant users (CI) and 48 speakers without hearing impairment (control group, KG). The data were analyzed in Veronika Neumeyers dissertation Akustische Analysen der Sprachproduktion von CI-Trägern (2015). CI_2_Cluster contains recordings used for the analysis of the temporal dynamics of the consonant cluster /Êtr/. The data was recorded using SpeechRecorder and automatically segmented using MAUS, followed by a manual correction of the target phoneme(s). The database is distributed as an emuR compatible database (emuDB format). Version 2 : removed derived spectra files *.dft from corpus.
Bayerisches Archiv für Sprachsignale
German
Dissertation Data Dr. Veronika Neumeyer: Sibilant Production in Cochlear Implant Patients – Homepage
The CI_2 corpora contain synchronous speech recordings of 48 cochlear implant users (CI) and 48 speakers without hearing impairment (control group, KG). The data were analyzed in Veronika Neumeyers dissertation Akustische Analysen der Sprachproduktion von CI-Trägern (2015). CI_2_Sibilants contains recordings used for the analysis of /s/ and /Ê/ in the following words: Tasse, Tasche. The data was recorded using SpeechRecorder and automatically segmented using MAUS, followed by a manual correction of the target phoneme(s). The database is distributed as an emuR compatible database (emuDB format). Version 3 : removed all derived spectra files *.dft from corpus.
Bayerisches Archiv für Sprachsignale
German
Dissertation Data Dr. Veronika Neumeyer: Sibilant Production in Cochlear Implant Patients (diachronic data) – Homepage
The CI_3 corpora contain diachronic speech recordings from three cochlear implant (CI) users which were analysed in the long term study part of Veronika Neumeyers PhD Thesis Akustische Analysen der Sprachproduktion von CI-Trägern (2015). Please note that the corpus is distributed as four separate subcorpora (CI_3_Cluster, CI_3_Sibilants, CI_3_VOT, CI_3_Vowels). For data used in the corresponding synchronic study, please refer to the CI_2 corpora. CI_3_Sibilants contains recordings used for the analysis of /s/ and /Ê/ in the following words: Tasse, Tasche. The data was recorded using SpeechRecorder and automatically segmented using MAUS, followed by a manual correction of the target phoneme(s). The database is distributed as an emuR compatible database (emuDB format).
Bayerisches Archiv für Sprachsignale
German
Dissertation Data Dr. Veronika Neumeyer: Voice Onset Time in Cochlear Implant Patients – Homepage
The CI_2 corpora contain synchronous speech recordings of 48 cochlear implant users (CI) and 48 speakers without hearing impairment (control group, KG). The data were analyzed in Veronika Neumeyers dissertation Akustische Analysen der Sprachproduktion von CI-Trägern (2015). CI_2_VOT contains recordings used for the analysis of voice onset time in /t/ in the word teilen. The data was recorded using SpeechRecorder and automatically segmented using MAUS, followed by a manual correction of the target phoneme(s). The database is distributed as an emuR compatible database (emuDB format). Version 3 : removed derived spectra files *.dft from speech corpus.
Bayerisches Archiv für Sprachsignale
German
Dissertation Data Dr. Veronika Neumeyer: Voice Onset Time in Cochlear Implant Patients (diachronic data) – Homepage
The CI_3 corpora contain diachronic speech recordings from three cochlear implant (CI) users which were analysed in the long term study part of Veronika Neumeyers PhD Thesis Akustische Analysen der Sprachproduktion von CI-Trägern (2015). Please note that the corpus is distributed as four separate subcorpora (CI_3_Cluster, CI_3_Sibilants, CI_3_VOT, CI_3_Vowels). For data used in the corresponding synchronic study, please refer to the CI_2 corpora. CI_3_VOT contains recordings used for the analysis of voice onset time in /t/ in the word teilen. The data was recorded using SpeechRecorder and automatically segmented using MAUS, followed by a manual correction of the target phoneme(s). The database is distributed as an emuR compatible database (emuDB format).
Bayerisches Archiv für Sprachsignale
German
Dissertation Data Dr. Veronika Neumeyer: Vowel Production in Cochlear Implant Patients – Homepage
The CI_2 corpora contain German speech recordings of 48 cochlear implant users (CI) and 48 speakers without hearing impairment (control group, KG). The data were analyzed in Veronika Neumeyers dissertation Akustische Analysen der Sprachproduktion von CI-Trägern (2015). CI_2_Vowels contains recordings used for the analysis of sevel long, lexically stressed vowels in the words Taten, stetig, Toter, Stute, töten, Tüte and kriegen. The data was recorded using SpeechRecorder and automatically segmented using MAUS, followed by a manual correction of the target phoneme(s). The database is distributed as an emuR compatible database (emuDB format). Version 3 : removed derived f0 analysis files *.sf0 from speech corpus.
Bayerisches Archiv für Sprachsignale
German
Dissertation Data Dr. Veronika Neumeyer: Vowel Production in Cochlear Implant Patients (diachronic data) – Homepage
The CI_3 corpora contain diachronic speech recordings from three cochlear implant (CI) users which were analysed in the long term study part of Veronika Neumeyers PhD Thesis Akustische Analysen der Sprachproduktion von CI-Trägern (2015). Please note that the corpus is distributed as four separate subcorpora (CI_3_Cluster, CI_3_Sibilants, CI_3_VOT, CI_3_Vowels). For data used in the corresponding synchronic study, please refer to the CI_2 corpora. CI_3_Vowels contains recordings used for the analysis of sevel long, lexically stressed vowels in the words Taten, stetig, Toter, Stute, töten, Tüte and kriegen. The data was recorded using SpeechRecorder and automatically segmented using MAUS, followed by a manual correction of the target phoneme(s). The database is distributed as an emuR compatible database (emuDB format). Version 2: Fixed a bug in the emuR DB config.
Bayerisches Archiv für Sprachsignale
German
Gesprochenes Wortkorpus für Untersuchungen zur auditiven Verarbeitung von Sprache und emotionaler Prosodie – Homepage
WaSeP contains recordings of one female and one male speaker, both professional actors, uttering single German nouns and pseudowords in multiple emotional prosodies. This edition improves the segmentation of the phonetic annotation, adds Praat TextGrid files and removes a few irregular items.
Bayerisches Archiv für Sprachsignale
German
LMU AsiCa – Homepage
The AsiCa-Corpus basically is a documentation of the South Italian dialect Calabrese. The main objects when building this corpus were the analysis of syntactical structures and their geolinguistic mapping in form of interactive, webbased cartography. The corpus consists of several audio files containing recordings of some sixty speakers of Calabrese one half of which having migration experience in Germany the other half almost always having stayed in Calabria. Furthermore the informants were selected equally balanced regarding gender, age and geographical origin. Of most of the informants there exist at least one recording with spontanous speech and one recording based on stimuli each. The results of syntactical analysis (maps and text) can be seen on the projects website at http://www.asica.gwi.uni-muenchen.de.
Bayerisches Archiv für Sprachsignale
Italian
MultiCHannel Articulatory database: English – Homepage
The MOCHA database was compiled as part of the Engineering and Physical Sciences Research Council grant number:GR/L78680 : Speech recognition using articulatory data. It features a set of 460 short sentences designed to include the main connected speech processes in English (e.g. assimilations, weak forms ...). All recordings made in the same sound damped studio at the Edinburgh Speech Production Facility based in the department of Speech and Language Sciences, Queen Margaret University College, UK. The database contains audio files, laryngograph waveforms, electromagnetic articulograph (EMA) tracks and electropalatograph (EPG) tracks. It is distributed as an emuR compatible EMU database. Conversion into this format was done at the Bavarian Archive for Speech Signals. The original database is available here: http://www.cstr.ed.ac.uk/research/projects/artic/mocha.html
Bayerisches Archiv für Sprachsignale
German
Natural Media Motion-Capture Corpus – Homepage
The Natural Media Motion Capture Corpus (NM-MoCap-Corpus) originates from a case study recorded in Aachen (Germany) in 2011 for a theory of Gesture Form Analysis (Hassemer & McCleary, in press, The multidimensionality of pointing. Gesture; Hassemer, 2016, Towards a theory of Gesture Form Analysis), which aimed at eliciting object descriptions containing gestural information about stimulus objects. Gesture Form Analysis bases on differentiating between the physical configuration of the articulating body part (articulator form) and the spatial information that an observer abstracts from articulator form (gesture form), for example by profiling specific parts of an articulator. Of particular interest were differences in the participants depiction of size information versus size and shape information. The corpus consists of data from 18 participants, whose task was to describe nine objects each to an experimenter, without using everyday vocabulary about forms, sizes or objects. The participants were recorded on audio and several video cameras, and their hand movements were recorded using an optical VICON motion capture system. ELAN annotations for gestural holds displaying size or shape information were generated semi-automatically from the motion capture data.<br/> Each participants sessions contains ten combined motion capture and video recordings (nine object descriptions and one calibration task). Each motion capture recording consists of three video files and two data files. For each object description one ELAN annotation file was produced. Furthermore, each participants data contain one HD video of the entire session. In total the corpus consists of:<br/> 557 video files (*.mp4) (one file missing)<br/> 720 annotation files (*.eaf)<br/> 162 motion capture data files (*.csv)<br/> The NM-MoCap-Corpus was curated for CLARIN as part of Curation Project 1 Editing and Integration of multimodal resources in CLARIN-D by CLARIN-D Working Group 6 Speech and Other Modalities. BAS CLARIN Repository version 2 contains manual annotations of articulator profile, gesture type, meaningful motion, difficult to code or not of both hands, according to gesture form analysis v.06 (Hassemer & McCleary, in press, The multidimensionality of pointing. Gesture; Hassemer, 2016, Towards a theory of Gesture Form Analysis).
Bayerisches Archiv für Sprachsignale
German
Nautilus Speaker Characterization – Homepage
NSC contains scripted, semi-spontaneous, and spontaneous human-human dialogs. In total, 300 speakers of German without noticeable accent participated and were recorded in an acoustically-isolated room. Interactions between speakers and their interlocutor are provided in separate mono files, accompanied by timestamps and tags that define the speakers turns. The speech corresponding to one of the semi-spontaneous dialogs was labeled with respect to perceived interpersonal speaker characteristics and naive voice descriptions. These labels are found alongside the documentation. Resource ISLRN: 157-037-166-491-1.
Bayerisches Archiv für Sprachsignale
German
OH.D Colonia Dignidad. Ein chilenisch-deutsches Oral History-Archiv – Homepage
Das Online-Archiv âColonia Dignidad. Ein chilenisch-deutsches Oral History-Archivâ enthält Interviews mit Zeitzeuginnen und Zeitzeugen einer deutschen Sektensiedlung im südlichen Chile. Zwischen 1961 und 2005 wurden die Sektenmitglieder, eigene und chilenische Kinder isoliert, indoktriniert, ausgebeutet, gequält und sexuell missbraucht.  In den 1990er Jahren übte Sektenführer Paul Schäfer sexuelle Gewalt gegen chilenische Jungen aus. Während der chilenischen Diktatur 1973 bis 1990 wurden Oppositionelle dort gefoltert und ermordet. <br/> Um Zugang zu den vollständigen Interviews, Fotos und Erläuterungen zu erhalten, müssen Sie sich registrieren. Dabei sind die Nutzungsbedingungen zu beachten, insbesondere die Persönlichkeitsrechte der Interviewten. Auf der begleitenden Webseite finden Sie weitere Informationen zum Archiv, zur Colonia Dignidad und zum Interviewprojekt. <br/> Das Archiv befindet sich noch im Aufbau.Â
Bayerisches Archiv für Sprachsignale
German, Spanish
OH.D Interview-Archiv Eiserner Vorhang – Homepage
Dieses Interview-Archiv beinhaltet 16 Video-Interviews mit Angehörigen, Freunden und Mitflüchtlingen von DDR-Bürgerinnen und DDR-Bürgern, die an der innerdeutschen Grenze, in der Ostsee und an der âverlängerten Mauerâ des Eisernen Vorhangs ums Leben kamen. Im Mittelpunkt dieser Interviews steht die Erinnerung an die Menschen, die dem Grenzregime der DDR und der anderen Ostblockstaaten zum Opfer fielen. Das Interview-Archiv beinhaltet zudem lebensgeschichtliche Interviews mit ehemaligen DDR-Bürgerinnen und DDR-Bürgern, die durch Ausreiseantragstellung oder gescheiterten Fluchtversuch der DDR-Willkürjustiz ausgesetzt waren, sowie einigen, denen eine erfolgreiche Flucht gelang. <br/> Um Zugang zu den vollständigen Interviews, Fotos und Erläuterungen zu erhalten, müssen Sie sich registrieren. Dabei sind die Nutzungsbedingungen zu beachten, insbesondere die Persönlichkeitsrechte der Interviewten. Weitere Informationen zum Projekt finden Sie auf der begleitenden Webseite.
Bayerisches Archiv für Sprachsignale
German
OHD Deutsches Gedächtnis – Homepage
Im  Archiv âDeutsches Gedächtnisâ werden subjektive Erinnerungszeugnisse  wie lebensgeschichtliche Interviews, Autobiographien, Tagebücher und Briefsammlungen ganz unterschiedlicher Menschen archiviert, die einen  Bezug zu gesellschaftspolitischen Ereignissen in Deutschland bzw. zur deutschen Geschichte haben. Sie stammen sowohl aus dem In- als auch aus  dem Ausland. Dementsprechend sind die Dokumente überwiegend in deutscher  Sprache, jedes fünfte Interview allerdings in einer anderen Sprache. <br/> Die  Interviews wurden seit den frühen 1980er-Jahren im Rahmen von  zeitÂgeschichtlichen ForschungsÂprojekten des Instituts und seiner Vorläuferprojekte geführt. Hinzu kommen biographische Interviews aus  Forschungen Dritter unterschiedlicher Disziplinen, die ihre Sammlungen  dem Archiv zur weiteren wissenschaftlichen Nutzung überlassen haben.  Neben Interviews werden auch schriftliche Erinnerungszeugnisse  archiviert wie AutoÂbiographien, FamilienÂchroniken, Tagebücher und  BriefÂsammlungen.
Bayerisches Archiv für Sprachsignale
German
Ph@ttSessionz Adolescents Speech Corpus – Homepage
The Ph@ttSessionz speech database contains recordings of 1019 adolescent speakers of German (age range 12-20). The recordings were performed via the WWW in public schools (Gymnasium) in 45 locations in Germany. The speech material recorded is a superset of the German SpeechDat-II and RVG-I corpora. It is now also available for download in emuR compatible format (starting from version 2.1.0).
Bayerisches Archiv für Sprachsignale
German
PhonDat 1 – Homepage
The corpus contains read speech of 201 different speakers. Each speaket read a subcorpus of 450 different sentence equivalents (including alphanumericals and two shorter passages of prose text); 8 speakers read the whole sentence corpus; 40 speakers read the subcorpora BR and MR; 112 speakers read 70 utterances of the rest corpus, including alphabet, numbers 0 to 12 and stories. The speakers were recorded at four different sites in Germany (University of Kiel, University of Bonn, University of Bochum, University of Munich). The language is German. The corpus contains a total of 21587 recorded utterances. Starting from version 4.1 (BAS CLARIN repository version 3), this corpus is available as an emuDB.
Bayerisches Archiv für Sprachsignale
German
PhonDat 2 – Homepage
The corpus contains read speech of 16 different speakers, 6 women and 10 men. Each speaker reads a corpus of 200 different sentences from a train query task. They were recorded at three different sites in Germany (University of Kiel, University of Bonn, University of Munich). The language is German. The corpus contains a total of 3200 recorded utterances. Starting with version 3.0 (BAS CLARIN Repository version 4), the corpus is also distributed as an emuDB.
Bayerisches Archiv für Sprachsignale
German
Schweizer Jugendsprache – Homepage
Recordings of adolescent pupils in Switzerland.
Bayerisches Archiv für Sprachsignale
Swiss German
SmartKom Audio – Homepage
This corpus contains the audio recordings of all actors who use the SmartKom system; it covers the audio recordings (no video) and annotations of all three original SmartKom corpora Public, Mobile and Home. Naive users were asked to test a prototype for a market study not knowing that the system was in fact controlled by two human operators. They were asked to solve two tasks in a period of 4,5 min while they were left alone with the system. The instruction was kept to a minimum; in fact the user only knew that the system is able to understand speech, gestures and even mimical expressions and should more or less communicate like a human.
Bayerisches Archiv für Sprachsignale
German
SmartKom Home – Homepage
This corpus contains multi modal recordings of 65 actors who use the SmartKom system. SmartKom Home should be an intelligent communication assistant for the private environment. Naive users were asked to test a prototype for a market study not knowing that the system was in fact controlled by two human operators. They were asked to solve two tasks in a period of 4,5 min while they were left alone with the system. The instruction was kept to a minimum; in fact the user only knew that the system is able to understand speech, gestures and even mimical expressions and should more or less communicate like a human. Version 1.3 (BAS CLARIN Repository Version 3): fixed duration column in BPF files in tiers USH, USM and OCC; duration in samples was too high by exactly 1. (BAS CLARIN Repository Version 4): edited Documentation
Bayerisches Archiv für Sprachsignale
German
SmartKom Mobil – Homepage
This corpus contains multi modal recordings of 73 actors who use the SmartKom system. SmartKom Mobil is a portable PDA equipped with a net link and additional intelligent communication devices. Naive users were asked to test a prototype for a market study not knowing that the system was in fact controlled by two human operators. They were asked to solve two tasks in a period of 4,5 min while they were left alone with the system. The instruction was kept to a minimum; in fact the user only knew that the system is able to understand speech, gestures and should more or less communicate like a human. Experiments were not performed in the field but rather in a studio-like environment. Background noise was played back artificially and the users did not carry the PDA in their hand but rather used a much smaller version of the SIVIT projection plane (to simulate a PDA display) and a pen as a pointing device. Speakers were speaking to a headset microphone. Version 1.4 (BAS CLARIN Repository Version 3: Updated Documentation) (BAS CLARIN Repository Version 4: re-coded mimic camera DV videos into a modern DV codec)
Bayerisches Archiv für Sprachsignale
German
SmartKom Public – Homepage
This corpus contains multi modal recordings of 86 actors who use the SmartKom system. SmartKom Public is comparable to a traditional public phone booth but equipped with additional intelligent communication devices. Naive users were asked to test a prototype for a market study not knowing that the system was in fact controlled by two human operators. They were asked to solve two tasks in a period of 4,5 min while they were left alone with the system. The instruction was kept to a minimum; in fact the user only knew that the system is able to understand speech, gestures and even mimical expressions and should more or less communicate like a human.
Bayerisches Archiv für Sprachsignale
German
SmartWeb Motorbike Corpus SMC – Homepage
The SMARTWEB UMTS data collection, of which the SMC corpus is a part, was created within the publicly funded German SmartWeb project in the years 2004 - 2006. It comprises a collection of user queries to a naturally spoken Web interface with the main focus on the soccer world series in 2006. The SMC corpus itself contains 36 mobile recordings performed on a BMW motorbike. Starting from version 2.6 (CLARIN Repository Version 2), SMC is also distributed as an emuR compatible emuDB.
Bayerisches Archiv für Sprachsignale
German
Spoken production of gender-neutral nouns in German – Homepage
This corpus examines the pronunciation of different genderneutral forms in German. Various source texts were used, like newspaper articles, websites, etc.
Bayerisches Archiv für Sprachsignale
German
The Karl-Eberhard-Corpus of spontaneously spoken conversations in Southern German – Homepage
The KEC contains 79 speakers of Southern German. Two speakers, usually acquainted with each other, had an one hour long conversation in separate booths. Manual annotation at the word level is provided, automatic annotation at the segment level as well as an automatic morphological tagging is added.
Bayerisches Archiv für Sprachsignale
German
The Zurich Tangram Corpus - BAS Edition – Homepage
This corpus contains tasks, where one subject (the instructor) describes different Tangram figures to another subject (the receiver) so that the receiver can recreate the same order of figures that the instructor has in front of them. The subjects initially dont know each other and work together to solve these tasks in three consecutive sessions. This edition only features the transcribed segments, not those in between, and uses separate files for the subject. If you would like the complete recordings with both subjects combined in a file (but missing the word and phone segmentation) see corpus ZTC_UZH.
Bayerisches Archiv für Sprachsignale
German
The Zurich Tangram Corpus - UZH Edition – Homepage
This corpus contains tasks, where one subject (the instructor) describes different Tangram figures to another subject (the receiver) so that the receiver can recreate the same order of figures that the instructor has in front of them. The subjects initially dont know each other and work together to solve these tasks in three consecutive sessions. This edition features the complete recordings, but lacking phone and word segmentation. Subjects audio tracks are combined into stereo files. If you would like just the transcribed segments with separate files for the subjects or want the word and phone segmentation see corpus ZTC_BAS.
Bayerisches Archiv für Sprachsignale
German
Untersuchung auditiver und akustischer Merkmale zur Evaluation der Stimmaehnlichkeit von Bruederpaaren unter forensischen Aspekten – Homepage
BROTHERS contains recordings of pairs of brothers between the ages of 19 and 31. The native and recorded language is German. The recordings were analyzed in Hanna Feisers dissertation Untersuchung auditiver und akustischer Merkmale zur Evaluation der Stimmähnlichkeit von Brüderpaaren unter forensischen Aspekten to evaluate the pair-wise similarity of sibling voices and the degree to which they are confused by listeners. Recordings consist of minimal pairs in carrier sentences, a different set of sentences aimed at elicitating the full range of German vowels (Berliner Sätze), and a spontaneous dialogue about a TV-series. Recordings were made via a table microphone (studio quality) and via telephone (telephone quality). Transcriptions and an automatically derived phonetic segmentation are provided along with the formant and fundamental frequency SSFF tracks used in the original dissertation. The corpus is provided as a ready-for-use emuR compatible emu database. This version fixes an error in the structure of the emu database in version 1.0 (incorrect mappings from bundles to annotation files).
Bayerisches Archiv für Sprachsignale
German
Berliner Zeitung (1994–2005) – Homepage
Berliner Zeitung (1994–2005) – Newspaper archive at the Berlin-Brandenburg Academy of Sciences and Humanities.
Berlin-Brandenburg Academy of Sciences and Humanities
German
Die Grenzboten (“Messengers from the Borders”) – Homepage
Die Grenzboten (“Messengers from the Borders”) (1841–1922) – digitized periodical from the archives of the State and University Library Bremen.
Berlin-Brandenburg Academy of Sciences and Humanities
German
Dortmund Chat Corpus – Homepage
Dortmund Chat Corpus – Basis for and aid to linguistic investigations of synchronic internet-based communication.
Berlin-Brandenburg Academy of Sciences and Humanities
German
DWDS-Kernkorpus – Homepage
DWDS-Kernkorpus – balanced corpus of 20th-century German at the Berlin-Brandenburg Academy of Sciences and Humanities.
Berlin-Brandenburg Academy of Sciences and Humanities
German
German Text Archive (DTA) – Homepage
German Text Archive (DTA) – Basis for a reference corpus of the New High German language.
Berlin-Brandenburg Academy of Sciences and Humanities
German
Polytechnical Journal (Dingler Online) – Homepage
Text corpus of the Polytechnical Journal (Dingler Online) at the BBAW CLARIN-D Service Centre.
Berlin-Brandenburg Academy of Sciences and Humanities
German
Reference Corpus of Middle High German (ReM) – Homepage
Reference Corpus of Middle High German (ReM) – a corpus of diplomatically transcribed and annotated texts from Middle High German (1050–1350).
Berlin-Brandenburg Academy of Sciences and Humanities
German
Tagesspiegel – Homepage
Tagesspiegel – Newspaper archive at the Berlin-Brandenburg Academy of Sciences and Humanities.
Berlin-Brandenburg Academy of Sciences and Humanities
German
Corpus LVK2018 (12M tokens)
Optional description not given for resource Corpus LVK2018 (12M tokens)
CLARIN Centre of Latvian language resources and tools
Latvian
BNC
CLARIN-PL Language Technology Centre
English
ClarinPL Corpus 1.0
CLARIN-PL Language Technology Centre
English
COCA
CLARIN-PL Language Technology Centre
English
English Wikipedia
CLARIN-PL Language Technology Centre
English
CFPP2000 – Homepage
Corpus de Français Parlé Parisien des années 2000
Collections de corpus oraux numeriques
French
CorpAfroAs – Homepage
Corpus Oral en langues Afroasiatiques
Collections de corpus oraux numeriques
Beja, Hausa, Libyan Arabic, Saya, Sudanese Creole Arabic
CRFP – Homepage
Corpus du Français Parlé de nos Régions
Collections de corpus oraux numeriques
French
Diakorp - written historical Czech
Diachronic corpus, version 6, 18 December 2015
Czech National Corpus
Czech
KSP v2 - contemporary poetry
Corpus of Contemporary Poetry, version 2, 13 September 2024
Czech National Corpus
Czech
Net v2 - semi-official Internet communication
Corpus of semi-official Internet communication, version 2 of 3 February 2021
Czech National Corpus
Czech
Online 2 Now - monitoring corpus of internet journalism
monitoring corpus of internet journalism for the past 6 months
Czech National Corpus
Czech
oral v1 - informal spoken Czech
ORAL corpora merge, version 1, 2.6.2017
Czech National Corpus
Czech
Orator v2 - monologue corpus
Monologue Corpus, version 2, 18 December 2020
Czech National Corpus
Czech
Ortofon v3 - informal spoken Czech
Corpus of informal spoken Czech with multilevel transcription, version 3, 15 July 2024
Czech National Corpus
Czech
SYN 2020 - contemporary written Czech
Synchronous representative and reference corpus of contemporary written Czech, containing 100 million text words.
Czech National Corpus
Czech
Ancient Greek Prose (Vanessa Gorman) – Homepage
A collection of hand-annotated dependency syntax trees of ancient Greek prose, using the Arethusa program available at the Perseids Project. This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License
Eberhard Karls Universität Tübingen
Ancient Greek (to 1453)
BulTreebank – Homepage
The main objective of BulTreeBank Project was to create a high quality set of syntactic structures of Bulgarian sentences within the framework of HPSG. Ideally, the tree bank should contain samples of all the syntactic structures of the language.
Eberhard Karls Universität Tübingen
Bulgarian
D/DC:Silbersee – Homepage
Automatically parsed edition of "Der Schatz im Silbersee" (1894) by Karl May, from the TüBa-D/DC.
Eberhard Karls Universität Tübingen
German
DDC:Prozess – Homepage
Automatically parsed edition of "Der Prozess" (1925) by Franz Kafka, from the TüBa-D/DC.
Eberhard Karls Universität Tübingen
German
Early-New-High-German – Homepage
The Early New High German treebank contains texts from a timespan covering the period from 1350 to 1650. The entire corpus consists of 21432 sentences and 600569 tokens.
Eberhard Karls Universität Tübingen
German
Europarl (Tüba D/DP r5) – Homepage
This treebank contains a German Parliamentary proceedings (2020). It contains ~2.2 million sentences and ~55 million tokens. It contains part-of-speech, dependency, morphology, lemmas, topological fields and named entity annotations. These annotations are obtained with a neural parser (https://github.com/stickeritis/sticker2). More information about the treebank can be found here (https://sfb833-a3.github.io/tueba-ddp/)
Eberhard Karls Universität Tübingen
German
Europarl14 – Homepage
This treebank contains the German sentences of the EuroParl subcorpus (Koehn, 2005) of the OPUS corpus (Tiedemann, 2012). It consists of proceedings of the European Parliament from 1996 to 2011. The part-of-speech and topological field (De Kok & Hinrichs, 2016) annotations were made using the neural tagger sticker. The corpus was parsed using the dpar neural dependency parser. Morphological annotations and lemmas were added using Marmot and Lemming (Müller et al., 2015). Each layer is machine-annotated following the TüBa-D/Z annotation guidelines. The tagset used for the annotations are universal dependencies and universal part-of-speech tags.
Eberhard Karls Universität Tübingen
German
German-Wikipedia-2018-Hamburger-Dependencies (formally: German Wikipedia Updated) – Homepage
This treebank contains the Wikipedia part of the TüBa-D/DP, based on a Wikipedia dump from 2018. It contains part-of-speech, morphology, lemmas, topological fields and dependency annotations. The dependency tagset are Hamburger Dependencies. For a newer version of this treebank with Universal dependencies and Wikipedia data from 2019, please check the Wikipedia-2019-Universal-Dependencies treebank in TüNDra. For more information / licensing and availability, please check the following Link: https://sfb833-a3.github.io/tueba-ddp/
Eberhard Karls Universität Tübingen
German
IT-TB-Dec-2015 – Homepage
Extracts from three works of Thomas Aquinas, containing 15,295 tagged and syntactically parsed sentences.
Eberhard Karls Universität Tübingen
Latin
IT-TB-Jun-2020 – Homepage
Extracts from three works of Thomas Aquinas, containing 26,831 tagged and syntactically parsed sentences.
Eberhard Karls Universität Tübingen
Latin
Perseus Greek – Homepage
The Ancient Greek Dependency Treebank from the Perseus Project
Eberhard Karls Universität Tübingen
Ancient Greek (to 1453)
Perseus Latin – Homepage
The Latin Dependency Treebank from the Perseus Project
Eberhard Karls Universität Tübingen
Latin
political speeches (Tüba D/DP r5) – Homepage
This treebank contains the political speeches corpus (Adrien Barbaresi). It contains ~619.000 sentences and ~12.8 million tokens. It contains part-of-speech, dependency, morphology, lemmas, topological fields and named entity annotations. These annotations are obtained with a neural parser (https://github.com/stickeritis/sticker2). More information about the treebank can be found here (https://sfb833-a3.github.io/tueba-ddp/)
Eberhard Karls Universität Tübingen
German
TüBa-D/S – Homepage
The Tübingen Treebank of Spoken German, TüBa-D/S, is a syntactically annotated corpus based on spontaneous dialogues, which were manually transliterated.
Eberhard Karls Universität Tübingen
German
TüBa-D/Z 10 – Homepage
TüBa-D/Z treebank of German, consisting of over 95,000 manually annotated newspaper texts from the German daily tageszeitung.
Eberhard Karls Universität Tübingen
German
TüBa-D/Z 10 Dep – Homepage
TüBa-D/Z treebank of German, consisting of over 95,000 manually annotated newspaper texts from the German daily Tageszeitung. Dependencies automatically constructed from the main TüBa-D/Z distribution. Not human corrected.
Eberhard Karls Universität Tübingen
German
TüBa-D/Z v11 – Homepage
TüBa-D/Z treebank is a manually syntactically annotated German newspaper corpus based on data taken from the daily issues of 'die tageszeitung' (taz). The treebank comprises 3,816 newspaper articles (104,787 sentences; 1,959,474 tokens).
Eberhard Karls Universität Tübingen
German
TüBa-D/Z v11 UD – Homepage
TüBa-D/Zis a manually syntactically annotated German newspaper corpus based on data taken from the daily issues of 'die tageszeitung' (taz). The treebank comprises 3,816 newspaper articles (104,787 sentences; 1,959,474 tokens). Dependencies automatically constructed from the main TüBa-D/Z distribution. Not human corrected. This version of the treebank has also been updated to contain a 'multiword' column, so that searches on the original forms of split contractions can be performed - e.g. 'im' which is split into 'in dem' in UD.
Eberhard Karls Universität Tübingen
German
TüBa-E/S – Homepage
The Tübingen Treebank of Spoken English, TüBa-E/S, is a syntactically annotated corpus based on spontaneous dialogues, which were manually transliterated.
Eberhard Karls Universität Tübingen
English
TüBa-J/S – Homepage
The Tübingen Treebank of Spoken Japanese, TüBa-J/S, is a syntactically annotated corpus based on spontaneous dialogues, which were manually transliterated.
Eberhard Karls Universität Tübingen
Japanese
TuebaDDC – Homepage
Tübingen Treebank of Written German - Diachronic Corpus
Eberhard Karls Universität Tübingen
German
UD Abaza-ATB – Homepage
UD_Abaza-ATB is a treebank based on [Spoken corpus of Abaza](http://lingconlab.ru/spoken_abaza/).
Eberhard Karls Universität Tübingen
Abaza
UD Abkhaz-AbNC – Homepage
UD_Abkhaz-AbNC is a treebank based on texts from the Abkhaz National Corpus, [AbNC](https://clarino.uib.no/abnc).
Eberhard Karls Universität Tübingen
Abkhazian
UD Afrikaans-AfriBooms – Homepage
UD Afrikaans-AfriBooms is a conversion of the AfriBooms Dependency Treebank, originally annotated with a simplified PoS set and dependency relations according to a subset of the Stanford tag set. The corpus consists of public government documents.
Eberhard Karls Universität Tübingen
Afrikaans
UD Akkadian-PISANDUB – Homepage
A small set of sentences from Babylonian royal inscriptions.
Eberhard Karls Universität Tübingen
Akkadian
UD Akuntsu-TuDeT – Homepage
UD_Akuntsu-TuDeT is a collection of annotated sentences in <a href="https://glottolog.org/resource/languoid/id/akun1241"> Akuntsú</a>. The sentences stem from the grammatical description by Aragon (2014) and Aragon's field work. Sentence annotation and documentation by Carolina Aragon, Fabrício Ferraz Gerardi, Luana dos Santos.
Eberhard Karls Universität Tübingen
Akuntsu
UD Albanian-TSA – Homepage
The UD Albanian Treebank is a small treebank for Standard Albanian, developed within a project framework at Uppsala University. The data was extracted from Wikipedia.
Eberhard Karls Universität Tübingen
Albanian
UD Amharic-ATT – Homepage
UD_Amharic-ATT is a manual developed Treebanks for Amharic. Sentences were collected from grammar books, fictions, biographies, religious texts and news.
Eberhard Karls Universität Tübingen
Amharic
UD Ancient Greek-Perseus – Homepage
This Universal Dependencies Ancient Greek Treebank consists of an automatic conversion of a selection of passages from the Ancient Greek and Latin Dependency Treebank 2.1
Eberhard Karls Universität Tübingen
Ancient Greek (to 1453)
UD Ancient Greek-PROIEL – Homepage
UD_Ancient_Greek-PROIEL is converted from the Ancient Greek data in the PROIEL treebank, and consists of the New Testament plus selections from Herodotus.
Eberhard Karls Universität Tübingen
Ancient Greek (to 1453)
UD Ancient Greek-PTNK – Homepage
UD Ancient Greek PTNK contains portions of the Septuagint according to the Codex Alexandrinus.
Eberhard Karls Universität Tübingen
Ancient Greek (to 1453)
UD Ancient Hebrew-PTNK – Homepage
UD Ancient Hebrew PTNK contains portions of the Biblia Hebraic Stuttgartensia with morphological annotations from [ETCBC](https://github.com/etcbc/bhsa).
Eberhard Karls Universität Tübingen
Ancient Hebrew
UD Apurina-UFPA – Homepage
This is an Apurinã treebank consisting of sentences from a grammatical description of the language by Maília Fernanda.
Eberhard Karls Universität Tübingen
Apurinã
UD Arabic-NYUAD – Homepage
The NYUAD Arabic UD treebank is based on the Penn Arabic Treebank (PATB), parts 1, 2, and 3, through conversion to CATiB dependency trees.
Eberhard Karls Universität Tübingen
Arabic
UD Arabic-PADT – Homepage
The Arabic-PADT UD treebank is based on the [Prague Arabic Dependency Treebank](http://ufal.mff.cuni.cz/padt/) (PADT), created at the Charles University in Prague.
Eberhard Karls Universität Tübingen
Arabic
UD Arabic-PUD – Homepage
This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the [CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies](http://universaldependencies.org/conll17/).
Eberhard Karls Universität Tübingen
Arabic
UD Armenian-ArmTDP – Homepage
A Universal Dependencies treebank for Eastern Armenian developed for UD originally by the ArmTDP team led by Marat M. Yavrumyan at the Yerevan State University.
Eberhard Karls Universität Tübingen
Armenian
UD Armenian-BSUT – Homepage
A Universal Dependencies treebank for Eastern Armenian developed for UD originally by the ArmTDP team led by Marat M. Yavrumyan at the V. Brusov State University in Yerevan.
Eberhard Karls Universität Tübingen
Armenian
UD Assyrian-AS – Homepage
The Uppsala Assyrian Treebank is a small treebank for Modern Standard Assyrian. The corpus is collected and annotated manually. The data was randomly collected from different textbooks and a short translation of The Merchant of Venice.
Eberhard Karls Universität Tübingen
Assyrian Neo-Aramaic
UD Azerbaijani-TueCL – Homepage
This is a small treebank of grammatical examples for Azerbaijani. The treebank tries to be neutral about the particular variety (North or South Azerbaijani, hence uses the ISO code for the macrolanguage (`az`).
Eberhard Karls Universität Tübingen
Azerbaijani
UD Bambara-CRB – Homepage
The UD Bambara treebank is a section of the Corpus Référence du Bambara annotated natively with Universal Dependencies.
Eberhard Karls Universität Tübingen
Bambara
UD Basque-BDT – Homepage
The Basque UD treebank is based on a automatic conversion from part of the Basque Dependency Treebank (BDT), created at the University of of the Basque Country by the IXA NLP research group. The treebank consists of 8.993 sentences (121.443 tokens) and covers mainly literary and journalistic texts.
Eberhard Karls Universität Tübingen
Basque
UD Bavarian-MaiBaam – Homepage
MaiBaam is manually annotated with part-of-speech tags and syntactic dependencies. The treebank encompasses diverse text genres (wiki articles and discussions, grammar examples, fiction, and commands for virtual assistants) and dialects from the North, Central and South Bavarian areas as well as the dialectal transition areas in between.
Eberhard Karls Universität Tübingen
Bavarian
UD Beja-NSC – Homepage
A Universal Dependencies corpus for Beja, North-Cushitic branch of the Afro-Asiatic phylum mainly spoken in Sudan, Egypt and Eritrea.
Eberhard Karls Universität Tübingen
Beja
UD Belarusian-HSE – Homepage
The Belarusian UD treebank is based on a sample of the news texts included in the Belarusian-Russian parallel subcorpus of the Russian National Corpus, online search available at: http://ruscorpora.ru/search-para-be.html.
Eberhard Karls Universität Tübingen
Belarusian
UD Bengali-BRU – Homepage
The BRU Bengali treebank has been created at Begum Rokeya University, Rangpur, by the members of Semantics Lab.
Eberhard Karls Universität Tübingen
Bengali
UD Bhojpuri-BHTB – Homepage
The [Bhojpuri](https://en.wikipedia.org/wiki/Bhojpuri_language) UD Treebank (BHTB) is a part of the [Universal Dependency treebank](http://universaldependencies.org/) project.
Eberhard Karls Universität Tübingen
Bhojpuri
UD Bororo-BDT – Homepage
UD_Bororo-BDT is a compilation of annotated sentences in [Bororo](https://glottolog.org/resource/languoid/id/boro1282). The corpus encompasses sentences derived from diverse sources: grammar examples, mythological narratives, fieldwork material, and other sources. Sentence annotation and documentation by [Fabrício Ferraz Gerardi](https://languagestructure.github.io).
Eberhard Karls Universität Tübingen
Borôro
UD Breton-KEB – Homepage
UD Breton-KEB is a treebank of Breton that has been manually annotated according to the Universal Dependencies guidelines. The tokenisation guidelines and morphological annotation comes from a finite-state morphological analyser of Breton released as part of the [Apertium project](http://www.apertium.org).
Eberhard Karls Universität Tübingen
Breton
UD Bulgarian-BTB – Homepage
UD_Bulgarian-BTB is based on the HPSG-based BulTreeBank, created at the Institute of Information and Communication Technologies, Bulgarian Academy of Sciences. The original consists of 215,000 tokens (over 15,000 sentences). All the texts were processed automatically at tokenization, morphological and chunk level. Then, the full syntactic analysis were perfomed manually by trained annotators.
Eberhard Karls Universität Tübingen
Bulgarian
UD Buryat-BDT – Homepage
The UD Buryat treebank was annotated manually natively in UD and contains grammar book sentences, along with news and some fiction.
Eberhard Karls Universität Tübingen
Buriat
UD Cantonese-HK – Homepage
A Cantonese treebank (in Traditional Chinese characters) of film subtitles and of legislative proceedings of Hong Kong, parallel with the Chinese-HK treebank.
Eberhard Karls Universität Tübingen
Yue Chinese
UD Cappadocian-TueCL – Homepage
This is a treebank of Pharasiot, a critically endangered Greek dialect originally spoken near Cappadocia. The source material is fairy tales collected during field study.
Eberhard Karls Universität Tübingen
Cappadocian Greek
UD Catalan-AnCora – Homepage
Catalan data from the [AnCora](http://clic.ub.edu/corpus/) corpus.
Eberhard Karls Universität Tübingen
Catalan
UD Cebuano-GJA – Homepage
UD_Cebuano_GJA is a collection of annotated Cebuano sample sentences randomly taken from three different sources: community-contributed samples from the website Tatoeba, a Cebuano grammar book by Bunye & Yap (1971) and Tanangkinsing's reference grammar on Cebuano (2011). This project is currently work in progress.
Eberhard Karls Universität Tübingen
Cebuano
UD Chinese-Beginner – Homepage
A treebank of Chinese sentences adapted for learner of level A1 to C1 (HSK1 to 5) collected on the [Chinese Grammar Wiki](https://resources.allsetlearning.com/chinese/grammar/\) (CC BY-NC-SA 3.0 License) website. The treebank was manually annotated by researchers of Paris Nanterre University (Modyco) in the mSUD annotation schema (morpheme level Surface Universal Dependencies).
Eberhard Karls Universität Tübingen
Chinese
UD Chinese-CFL – Homepage
The Chinese-CFL UD treebank is manually annotated by Keying Li with minor manual revisions by Herman Leung and John Lee at City University of Hong Kong, based on essays written by learners of Mandarin Chinese as a foreign language. The data is in Simplified Chinese.
Eberhard Karls Universität Tübingen
Chinese
UD Chinese-GSD – Homepage
Traditional Chinese Universal Dependencies Treebank annotated and converted by Google.
Eberhard Karls Universität Tübingen
Chinese
UD Chinese-GSDSimp – Homepage
Simplified Chinese Universal Dependencies dataset converted from the GSD (traditional) dataset with manual corrections.
Eberhard Karls Universität Tübingen
Chinese
UD Chinese-HK – Homepage
A Traditional Chinese treebank of film subtitles and of legislative proceedings of Hong Kong, parallel with the Cantonese-HK treebank.
Eberhard Karls Universität Tübingen
Chinese
UD Chinese-PatentChar – Homepage
A treebank of Chinese patent application texts collected from the Chinese patent office's website CNIPA. The sentences are randomly selected from the patent claims of the IPC section "G" from November 2017 to September 2018.
Eberhard Karls Universität Tübingen
Chinese
UD Chinese-PUD – Homepage
This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the [CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies](http://universaldependencies.org/conll17/).
Eberhard Karls Universität Tübingen
Chinese
UD Chukchi-HSE – Homepage
This data is a manual annotation of the corpus from multimedia annotated corpus of the [Chuklang](http://chuklang.ru/) project, a dialectal corpus of the Amguema variant of Chukchi.
Eberhard Karls Universität Tübingen
Chukot
UD Classical Armenian-CAVaL – Homepage
The present release includes a treebank of the Classical Armenian translation of the four Gospels (95370 tokens in 4146 sentences) as part of the UD Classical Armenian-CAVaL treebank project. It results from a conversion of the PROIEL annotation of the Classical Armenian Gospels, which has been manually corrected and extended with additional information.
Eberhard Karls Universität Tübingen
Classical Armenian
UD Classical Chinese-Kyoto – Homepage
Classical Chinese Universal Dependencies Treebank annotated and converted by Institute for Research in Humanities, Kyoto University.
Eberhard Karls Universität Tübingen
Literary Chinese
UD Classical Chinese-TueCL – Homepage
A dependency Treebank of "逍遥游(Enjoyment in Untroubled Ease)" written by Zhuangzi.
Eberhard Karls Universität Tübingen
Literary Chinese
UD Coptic-Scriptorium – Homepage
UD Coptic contains manually annotated Sahidic Coptic texts, including Biblical texts, sermons, letters, and hagiography.
Eberhard Karls Universität Tübingen
Coptic
UD Croatian-SET – Homepage
The Croatian UD treebank is based on the extension of the SETimes-HR corpus, the [hr500k](http://hdl.handle.net/11356/1183) corpus.
Eberhard Karls Universität Tübingen
Croatian
UD Czech-CAC – Homepage
The UD_Czech-CAC treebank is based on the Czech Academic Corpus 2.0 (CAC; Český akademický korpus; ČAK), created at Charles University in Prague.
Eberhard Karls Universität Tübingen
Czech
UD Czech-CLTT – Homepage
The UD_Czech-CLTT treebank is based on the Czech Legal Text Treebank 2.0, created at the Charles University in Prague.
Eberhard Karls Universität Tübingen
Czech
UD Czech-FicTree – Homepage
FicTree is a treebank of Czech fiction, automatically converted into the UD format. The treebank was built at Charles University in Prague.
Eberhard Karls Universität Tübingen
Czech
UD Czech-PDT – Homepage
The Czech-PDT UD treebank is based on the Prague Dependency Treebank – Consolidated 1.0 (PDT-C), created at the Charles University in Prague.
Eberhard Karls Universität Tübingen
Czech
UD Czech-Poetry – Homepage
UD_Czech-Poetry contains random samples of Czech 19th-century poetry from the Corpus of Czech Verse parsed with UDPipe2 (trained on UD Czech-PDT 2.11) and manually corrected.
Eberhard Karls Universität Tübingen
Czech
UD Czech-PUD – Homepage
This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the [CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies](http://universaldependencies.org/conll17/).
Eberhard Karls Universität Tübingen
Czech
UD Danish-DDT – Homepage
The Danish UD treebank is a conversion of the Danish Dependency Treebank.
Eberhard Karls Universität Tübingen
Danish
UD Dutch-Alpino – Homepage
This corpus consists of samples from various treebanks annotated at the University of Groningen using the Alpino annotation tools and guidelines.
Eberhard Karls Universität Tübingen
Dutch
UD Dutch-LassySmall – Homepage
This corpus contains sentences from the Wikipedia section of the Lassy Small Treebank. Universal Dependency annotation was generated automatically from the original annotation in Lassy.
Eberhard Karls Universität Tübingen
Dutch
UD Egyptian-UJaen – Homepage
Egyptian-UJaen is the first morphosyntactic treebank created for Pre-Coptic Egyptian in Universal Dependencies. It contains sentences manually annotated at the University of Jaén (Spain) that were selected from texts written in Old Egyptian, Middle Egyptian, Late Egyptian and Demotic.
Eberhard Karls Universität Tübingen
Egyptian (Ancient)
UD English-Atis – Homepage
UD Atis Treebank is a manually annotated treebank consisting of the sentences in the Atis (Airline Travel Informations) dataset which includes the human speech transcriptions of people asking for flight information on the automated inquiry systems.
Eberhard Karls Universität Tübingen
English
UD English-CTeTex – Homepage
UD_English-CTeTex is a technical text corpus annotated in Universal Dependency syntax containing 196 software requirements.
Eberhard Karls Universität Tübingen
English
UD English-ESLSpok – Homepage
This repository includes the Dependency Treebank of Spoken L2 English (SL2E), which consists of Universal Dependency annotations for a random sample of sentences from the <a href="https://alaginrc.nict.go.jp/nict_jle/index_E.html" target="_blank">NICT JLE</a>, a corpus of spoken second language English. <a href="https://github.com/LCR-ADS-Lab/SL2E-Dependency-Treebank" target="_blank">The homepage of the project is here.</a>
Eberhard Karls Universität Tübingen
English
UD English-EWT – Homepage
A Gold Standard Universal Dependencies Corpus for English, built over the source material of the English Web Treebank LDC2012T13 (https://catalog.ldc.upenn.edu/LDC2012T13).
Eberhard Karls Universität Tübingen
English
UD English-GENTLE – Homepage
Repository for the Genre Tests for Linguistic Evaluation (GENTLE) Corpus
Eberhard Karls Universität Tübingen
English
UD English-GUM – Homepage
Universal Dependencies syntax annotations from the GUM corpus (https://gucorpling.org/gum/)
Eberhard Karls Universität Tübingen
English
UD English-GUMReddit – Homepage
Universal Dependencies syntax annotations from the Reddit portion of the GUM corpus (https://gucorpling.org/gum/)
Eberhard Karls Universität Tübingen
English
UD English-LinES – Homepage
UD English_LinES is the English half of the LinES Parallel Treebank with the original dependency annotation first automatically converted into Universal Dependencies and then partially reviewed. Its contents cover literature, an online manual and Europarl data.
Eberhard Karls Universität Tübingen
English
UD English-ParTUT – Homepage
UD_English-ParTUT is a conversion of a multilingual parallel treebank developed at the University of Turin, and consisting of a variety of text genres, including talks, legal texts and Wikipedia articles, among others.
Eberhard Karls Universität Tübingen
English
UD English-Pronouns – Homepage
UD English-Pronouns is dataset created to make pronoun identification more accurate and with a more balanced distribution across genders. The dataset is initially targeting the Independent Genitive pronouns, "hers", (independent) "his", (singular) "theirs", "mine", and (singular) "yours".
Eberhard Karls Universität Tübingen
English
UD English-PUD – Homepage
This is the English portion of the Parallel Universal Dependencies (PUD) treebanks created for the CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies (http://universaldependencies.org/conll17/).
Eberhard Karls Universität Tübingen
English
UD Erzya-JR – Homepage
UD Erzya is the original annotation (CoNLL-U) for texts in the Erzya language, it originally consists of a sample from a number of fiction authors writing originals in Erzya.
Eberhard Karls Universität Tübingen
Erzya
UD Estonian-EDT – Homepage
UD Estonian is a converted version of the Estonian Dependency Treebank (EDT), originally annotated in the Constraint Grammar (CG) annotation scheme, and consisting of genres of fiction, newspaper texts and scientific texts. The treebank contains 30,972 trees, 437,769 tokens.
Eberhard Karls Universität Tübingen
Estonian
UD Estonian-EWT – Homepage
UD EWT treebank consists of different genres of new media. The treebank contains 7,190 trees, 90,585 tokens.
Eberhard Karls Universität Tübingen
Estonian
UD Faroese-FarPaHC – Homepage
UD_Faroese-FarPaHC is a conversion of the [Faroese Parsed Historical Corpus (FarPaHC)](https://github.com/einarfs/farpahc) to the Universal Dependencies scheme. The conversion was done using [UDConverter](https://github.com/thorunna/UDConverter).
Eberhard Karls Universität Tübingen
Faroese
UD Faroese-OFT – Homepage
This is a treebank of Faroese based on the Faroese Wikipedia.
Eberhard Karls Universität Tübingen
Faroese
UD Finnish-FTB – Homepage
FinnTreeBank 1 consists of manually annotated grammatical examples from VISK. The UD version of FinnTreeBank 1 was converted from a native annotation model with a script and later manually revised.
Eberhard Karls Universität Tübingen
Finnish
UD Finnish-OOD – Homepage
Finnish-OOD is an external out-of-domain test set for Finnish-TDT annotated natively into UD scheme.
Eberhard Karls Universität Tübingen
Finnish
UD Finnish-PUD – Homepage
This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the [CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies](http://universaldependencies.org/conll17/).
Eberhard Karls Universität Tübingen
Finnish
UD Finnish-TDT – Homepage
UD_Finnish-TDT is based on the Turku Dependency Treebank (TDT), a broad-coverage dependency treebank of general Finnish covering numerous genres. The conversion to UD was followed by extensive manual checks and corrections, and the treebank closely adheres to the UD guidelines.
Eberhard Karls Universität Tübingen
Finnish
UD French-FQB – Homepage
The corpus **UD_French-FQB** is an automatic conversion of the [French QuestionBank v1](http://alpage.inria.fr/Treebanks/FQB/), a corpus entirely made of questions.
Eberhard Karls Universität Tübingen
French
UD French-GSD – Homepage
The **UD_French-GSD** was converted in 2015 from the content head version of the universal dependency treebank v2.0 (https://github.com/ryanmcd/uni-dep-tb). It is updated since 2015 independently from the previous source.
Eberhard Karls Universität Tübingen
French
UD French-ParisStories – Homepage
Paris Stories is a corpus of oral French collected and transcribed by Linguistics students from Sorbonne Nouvelle and corrected by students from the Plurital Master's Degree of Computational Linguistics ( Inalco, Paris Nanterre, Sorbonne Nouvelle) between 2017 and 2021. It contains monologues and dialogues from speakers living in the Parisian region.
Eberhard Karls Universität Tübingen
French
UD French-ParTUT – Homepage
UD_French-ParTUT is a conversion of a multilingual parallel treebank developed at the University of Turin, and consisting of a variety of text genres, including talks, legal texts and Wikipedia articles, among others.
Eberhard Karls Universität Tübingen
French
UD French-PUD – Homepage
This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the [CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies](http://universaldependencies.org/conll17/).
Eberhard Karls Universität Tübingen
French
UD French-Rhapsodie – Homepage
A Universal Dependencies corpus for spoken French.
Eberhard Karls Universität Tübingen
French
UD French-Sequoia – Homepage
**UD_French-Sequoia** is an automatic conversion of the [SUD_French-Sequoia](https://github.com/surfacesyntacticud/SUD_French-Sequoia) treebank, which comes from the former corpus [French Sequoia corpus](http://deep-sequoia.inria.fr).
Eberhard Karls Universität Tübingen
French
UD Frisian Dutch-Fame – Homepage
UD_Frisian_Dutch-Fame is a selection of 400 sentences from the FAME! speech corpus by Yilmaz et al. (2016a, 2016b). The treebank is manually annotated using the UD scheme.
Eberhard Karls Universität Tübingen
Saterfriesisch
UD Galician-CTG – Homepage
The Galician UD treebank is based on the automatic parsing of the Galician Technical Corpus (http://sli.uvigo.gal/CTG) created at the University of Vigo by the the TALG NLP research group.
Eberhard Karls Universität Tübingen
Galician
UD Galician-PUD – Homepage
The Galician PUD is a treebank for Galician developed at CiTIUS (Universidade de Santiago de Compostela).
Eberhard Karls Universität Tübingen
Galician
UD Galician-TreeGal – Homepage
The Galician-TreeGal is a treebank for Galician developed at LyS Group (Universidade da Coruña) and at CiTIUS (Universidade de Santiago de Compostela).
Eberhard Karls Universität Tübingen
Galician
UD Georgian-GLC – Homepage
The Georgian UD Treebank (UD_Georgian-GLC) is the first syntactically annotated corpus of Georgian, based on a collection of annotated sentences selected from the Georgian Language Corpus (GLC) available at http://corpora.iliauni.edu.ge/ and sentences selected from Wiki in accordance with the 132 scientific fields.
Eberhard Karls Universität Tübingen
Georgian
UD German-GSD – Homepage
The German UD is converted from the content head version of the [universal dependency treebank v2.0 (legacy)](https://github.com/ryanmcd/uni-dep-tb).
Eberhard Karls Universität Tübingen
German
UD German-HDT – Homepage
UD German-HDT is a conversion of the Hamburg Dependency Treebank, created at the University of Hamburg through manual annotation in conjunction with a standard for morphologically and syntactically annotating sentences as well as a constraint-based parser.
Eberhard Karls Universität Tübingen
German
UD German-LIT – Homepage
This treebank aims at gathering texts of the German literary history. Currently, it hosts Fragments of the early Romanticism, i.e. aphorism-like texts mainly dealing with philosophical issues concerning art, beauty and related topics.
Eberhard Karls Universität Tübingen
German
UD German-PUD – Homepage
This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the [CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies](http://universaldependencies.org/conll17/).
Eberhard Karls Universität Tübingen
German
UD Gheg-GPS – Homepage
UD Gheg Pear Stories (GPS) contains renarrations of Wallace Chafe's Pear Stories video (pearstories.org) by heritage speakers of Gheg Albanian living in Switzerland and speakers from Prishtina.
Eberhard Karls Universität Tübingen
Gheg Albanian
UD Gothic-PROIEL – Homepage
The UD Gothic treebank is based on the Gothic data from the PROIEL treebank, and consists of Wulfila's Bible translation.
Eberhard Karls Universität Tübingen
Gothic
UD Greek-GDT – Homepage
The Greek UD treebank (UD_Greek-GDT) is derived from the Greek Dependency Treebank (http://gdt.ilsp.gr), a resource developed and maintained by researchers at the Institute for Language and Speech Processing/Athena R.C. (http://www.ilsp.gr).
Eberhard Karls Universität Tübingen
Modern Greek (1453-)
UD Greek-GUD – Homepage
GUD is a resource for EL manually annotated for morphology and syntax. It is an ongoing project led by Stella Markantonatou and Vivian Stamou (hereinafter: the GUD team), both researchers at the [Institute for Language and Speech Processing](http://www.ilsp.gr/) (ILSP/Athena Research Centre).
Eberhard Karls Universität Tübingen
Modern Greek (1453-)
UD Guajajara-TuDeT – Homepage
UD_Guajajara-TuDeT is a collection of annotated sentences in <a href="https://glottolog.org/resource/languoid/id/guaj1255">Guajajara</a>. Sentences stem from multiple sources such as descriptions of the language, short stories, dictionaries and translations from the New Testament. Sentence annotation and documentation by Lorena Martín Rodríguez and Fabrício Ferraz Gerardi.
Eberhard Karls Universität Tübingen
Guajajára
UD Guarani-OldTuDeT – Homepage
UD_Guarani-OldTuDeT is a collection of annotated texts in <a href="https://glottolog.org/resource/languoid/id/oldp1258">Old Guaraní</a>. All known sources in this language are being annotated: cathesisms, grammars (seventeenth and eighteenth century), sentences from dictionaries, and other texts. Sentence annotation and documentation by Fabrício Ferraz Gerardi and Lorena Martín Rodríguez.
Eberhard Karls Universität Tübingen
Guarani
UD Gujarati-GujTB – Homepage
GujTB is an in-progress treebank of Gujarati (an Indo-Aryan language) in Gujarati script.
Eberhard Karls Universität Tübingen
Gujarati
UD Haitian Creole-Autogramm – Homepage
This is a treebank of Haitian creole. It contains 144 sentences selected from 3 major genres: bible, literary texts, newspapers. Kreyòl (Kreyòl Ayisyen, Haitian Creole, iso-639-1: ht) is the main language of Haïti. The dialect described here is the Cap Haïtien dialect which differs slightly in its lexicon with Center and South varieties.
Eberhard Karls Universität Tübingen
Haitian
UD Hausa-NorthernAutogramm – Homepage
This treebank contains data of Northern Autogramm, for the Ader dialect of Niger Republic (Northern Hausa).
Eberhard Karls Universität Tübingen
Hausa
UD Hausa-SouthernAutogramm – Homepage
This treebank contains data of Southern Autogramm, for the Zaria dialect of Nigeria (Southern Hausa).
Eberhard Karls Universität Tübingen
Hausa
UD Hebrew-HTB – Homepage
A Universal Dependencies Corpus for Hebrew.
Eberhard Karls Universität Tübingen
Hebrew
UD Hebrew-IAHLTwiki – Homepage
Publicly available subset of the IAHLT UD Hebrew Treebank's Wikipedia section (https://www.iahlt.org/)
Eberhard Karls Universität Tübingen
Hebrew
UD Highland Puebla Nahuatl-ITML – Homepage
UD_Highland_Puebla_Nahuatl-ITML is a collection of texts in the Highland Puebla variety of Nahuatl (ISO-639: `azz`) spoken in 24 municipalities in the state of Mexico in Puebla. The treebank contains spoken monologue and dialogue, scientific texts translated from Spanish and some miscellaneous grammatical examples from a language course.
Eberhard Karls Universität Tübingen
Highland Puebla Nahuatl
UD Hindi-HDTB – Homepage
The Hindi UD treebank is based on the Hindi Dependency Treebank (HDTB), created at IIIT Hyderabad, India.
Eberhard Karls Universität Tübingen
Hindi
UD Hindi-PUD – Homepage
This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the [CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies](http://universaldependencies.org/conll17/).
Eberhard Karls Universität Tübingen
Hindi
UD Hittite-HitTB – Homepage
UD_Hittite-HitTB is a small Universal Dependencies treebank for Hittite, containing original sentences from Hoffner and Melchert's tutorial to A Grammar of the Hittite Language.
Eberhard Karls Universität Tübingen
Hittite
UD Hungarian-Szeged – Homepage
The Hungarian UD treebank is derived from the Szeged Dependency Treebank (Vincze et al. 2010).
Eberhard Karls Universität Tübingen
Hungarian
UD Icelandic-GC – Homepage
UD_Icelandic-GC is a conversion of the gold part of [GreynirCorpus](https://github.com/mideind/GreynirCorpus), which has been manually corrected and verified. The corpus is parsed into full constituency trees, and converted using [UDConverter-GreynirCorpus](https://github.com/thorunna/UDConverter-GreynirCorpus).
Eberhard Karls Universität Tübingen
Icelandic
UD Icelandic-IcePaHC – Homepage
UD_Icelandic-IcePaHC is a conversion of the [Icelandic Parsed Historical Corpus (IcePaHC)](https://linguist.is/icelandic_treebank/Icelandic_Parsed_Historical_Corpus_(IcePaHC)) to the Universal Dependencies scheme. The conversion was done using [UDConverter](https://github.com/thorunna/UDConverter).
Eberhard Karls Universität Tübingen
Icelandic
UD Icelandic-Modern – Homepage
UD_Icelandic-Modern is a conversion of the [modern additions](https://github.com/antonkarl/icecorpus/tree/master/additions2019) to the Icelandic Parsed Historical Corpus (IcePaHC) to the Universal Dependencies scheme.
Eberhard Karls Universität Tübingen
Icelandic
UD Icelandic-PUD – Homepage
Icelandic-PUD is the Icelandic part of the Parallel Universal Dependencies (PUD) treebanks.
Eberhard Karls Universität Tübingen
Icelandic
UD Indonesian-CSUI – Homepage
UD Indonesian-CSUI is a conversion from an Indonesian constituency treebank in the Penn Treebank format named [**Kethu**](https://github.com/ialfina/kethu) that was also a conversion from a constituency treebank built by [**Dinakaramani et al. (2015)**](https://github.com/famrashel/idn-treebank). We named this treebank **Indonesian-CSUI**, since all the three versions of the treebanks were built at Faculty of Computer Science, Universitas Indonesia.
Eberhard Karls Universität Tübingen
Indonesian
UD Indonesian-GSD – Homepage
The Indonesian-GSD treebank was originally converted from the content head version of the [universal dependency treebank v2.0 (legacy)](https://github.com/ryanmcd/uni-dep-tb) in 2015. In order to comply with the latest Indonesian annotation guidelines, the treebank has undergone a major revision between UD releases v2.8 and v2.9 (2021).
Eberhard Karls Universität Tübingen
Indonesian
UD Indonesian-PUD – Homepage
This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the [CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies](http://universaldependencies.org/conll17/).
Eberhard Karls Universität Tübingen
Indonesian
UD Irish-Cadhan – Homepage
This is the Cadhan Aonair UD treebank, consisting of 150 sentences randomly sampled from six pre-standard Irish texts. It was subsequently augmented with a late Early Modern Irish syllabic poem representing 43 sentences, described in a [separate section below](
Eberhard Karls Universität Tübingen
Irish
UD Irish-IDT – Homepage
A Universal Dependencies 4910-sentence treebank for modern Irish.
Eberhard Karls Universität Tübingen
Irish
UD Irish-TwittIrish – Homepage
A Universal Dependencies treebank of 2596 tweets in modern Irish.
Eberhard Karls Universität Tübingen
Irish
UD Italian-ISDT – Homepage
The Italian corpus annotated according to the UD annotation scheme was obtained by conversion from ISDT (Italian Stanford Dependency Treebank), released for the dependency parsing shared task of Evalita-2014 (Bosco et al. 2014).
Eberhard Karls Universität Tübingen
Italian
UD Italian-Old – Homepage
Italian-Old is a treebank containing **Dante Alighieri's Comedy**, based on the 1994 Petrocchi edition and taken from the [**DanteSearch corpus**](https://dantesearch.dantenetwork.it), originally created at the University of Pisa, Italy. The syntactic annotation has been done from scratch, following UD annotation scheme. It is a treebank of Old Italian, specifically Florentine. The Comedy was composed between approximately 1306 and 1321.
Eberhard Karls Universität Tübingen
Italian
UD Italian-ParlaMint – Homepage
ParlaMint-It is a collection of transcriptions of parliamentary sessions of the Italian Senate annotated in Universal Dependencies. The corpus is part of a larger multilingual collection of parliamentary transcripts built during the ParlaMint project (https://www.clarin.eu/parlamint).
Eberhard Karls Universität Tübingen
Italian
UD Italian-ParTUT – Homepage
UD_Italian-ParTUT is a conversion of a multilingual parallel treebank developed at the University of Turin, and consisting of a variety of text genres, including talks, legal texts and Wikipedia articles, among others.
Eberhard Karls Universität Tübingen
Italian
UD Italian-PoSTWITA – Homepage
PoSTWITA-UD is a collection of Italian tweets annotated in Universal Dependencies that can be exploited for the training of NLP systems to enhance their performance on social media texts.
Eberhard Karls Universität Tübingen
Italian
UD Italian-PUD – Homepage
This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the [CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies](http://universaldependencies.org/conll17/).
Eberhard Karls Universität Tübingen
Italian
UD Italian-TWITTIRO – Homepage
TWITTIRÒ-UD is a collection of ironic Italian tweets annotated in Universal Dependencies. The treebank can be exploited for the training of NLP systems to enhance their performance on social media texts, and in particular, for irony detection purposes.
Eberhard Karls Universität Tübingen
Italian
UD Italian-Valico – Homepage
Manually corrected Treebank of Learner Italian drawn from the Valico corpus and correspondent corrected sentences.
Eberhard Karls Universität Tübingen
Italian
UD Italian-VIT – Homepage
The UD_Italian-VIT corpus was obtained by conversion from VIT (Venice Italian Treebank), developed at the Laboratory of Computational Linguistics of the Università Ca' Foscari in Venice (Delmonte et al. 2007; Delmonte 2009; http://rondelmo.it/resource/VIT/Browser-VIT/index.htm).
Eberhard Karls Universität Tübingen
Italian
UD Japanese-BCCWJ – Homepage
This Universal Dependencies (UD) Japanese treebank is based on the definition of UD Japanese convention described in the UD documentation. The original sentences are from `Balanced Corpus of Contemporary Written Japanese'(BCCWJ).
Eberhard Karls Universität Tübingen
Japanese
UD Japanese-BCCWJLUW – Homepage
This Universal Dependencies (UD) Japanese treebank is based on the definition of UD Japanese convention described in the UD documentation. The original sentences are from `Balanced Corpus of Contemporary Written Japanese'(BCCWJ). UD-Japanese-BCCWJLUW is the other word segmentation version of UD-Japanese-BCCWJ. We use **Long Unit Word (LUW)** as their syntactic word in UD definition.
Eberhard Karls Universität Tübingen
Japanese
UD Japanese-GSD – Homepage
This Universal Dependencies (UD) Japanese treebank is based on the definition of UD Japanese convention described in the UD documentation. The original sentences are from Google UDT 2.0.
Eberhard Karls Universität Tübingen
Japanese
UD Japanese-GSDLUW – Homepage
This Universal Dependencies (UD) Japanese treebank is based on the definition of UD Japanese convention described in the UD documentation. The original sentences are from Google UDT 2.0.
Eberhard Karls Universität Tübingen
Japanese
UD Japanese-PUD – Homepage
This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the [CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies](http://universaldependencies.org/conll17/).
Eberhard Karls Universität Tübingen
Japanese
UD Japanese-PUDLUW – Homepage
This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the [CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies](http://universaldependencies.org/conll17/).
Eberhard Karls Universität Tübingen
Japanese
UD Javanese-CSUI – Homepage
UD Javanese-CSUI is a dependency treebank in Javanese, a regional language in Indonesia with more than 68 million users. It was developed by Alfina et al. from the Faculty of Computer Science, Universitas Indonesia. The newest version has 1000 sentences and 14K words with manual annotation.
Eberhard Karls Universität Tübingen
Javanese
UD Kaapor-TuDeT – Homepage
**UD_Kaapor-TuDeT** is a collection of annotated sentences in [Ka'apor](https://glottolog.org/resource/languoid/id/urub1250). The project is a work in progress and the treebank is being updated on a regular basis.
Eberhard Karls Universität Tübingen
Urubú-Kaapor
UD Kangri-KDTB – Homepage
The Kangri UD Treebank (KDTB) is a part of the Universal Dependency treebank project.
Eberhard Karls Universität Tübingen
Kangri
UD Karelian-KKPP – Homepage
UD Karelian-KKPP is a manually annotated new corpus of Karelian made in Universal dependencies annotation scheme. The data is collected from [VepKar corpora](http://dictorpus.krc.karelia.ru/en/corpus/text) and consists of mostly modern news texts but also some stories and educational texts.
Eberhard Karls Universität Tübingen
Karelian
UD Karo-TuDeT – Homepage
UD_Karo-TuDeT is a collection of annotated sentences in <a href="https://glottolog.org/resource/languoid/id/karo1306"> Karo</a>. The sentences stem from the only grammatical description of the language (Gabas, 1999) and from the sentences in the dictionary by the same author (Gabas, 2007). Sentence annotation and documentation by Fabrício Ferraz Gerardi.
Eberhard Karls Universität Tübingen
Karo (Brazil)
UD Kazakh-KTB – Homepage
The UD Kazakh treebank is a combination of text from various sources including Wikipedia, some folk tales, sentences from the UDHR, news and phrasebook sentences. Sentences IDs include partial document identifiers.
Eberhard Karls Universität Tübingen
Kazakh
UD Khunsari-AHA – Homepage
The AHA Khunsari Treebank is a small treebank for contemporary Khunsari. Its corpus is collected and annotated manually. We have prepared this treebank based on interviews with Khunsari speakers.
Eberhard Karls Universität Tübingen
Khunsari
UD Kiche-IU – Homepage
UD Kʼicheʼ-IU is a treebank consisting of sentences from a variety of text domains but principally dictionary example sentences and linguistic examples.
Eberhard Karls Universität Tübingen
K'iche'
UD Komi Permyak-UH – Homepage
This is a Komi-Permyak literary language treebank consisting of original and translated texts.
Eberhard Karls Universität Tübingen
Komi-Permyak
UD Komi Zyrian-IKDP – Homepage
This treebank consists of dialectal transcriptions of spoken Komi-Zyrian. The current texts are short recorded segments from different areas where the Iźva dialect of Komi language is spoken.
Eberhard Karls Universität Tübingen
Komi
UD Komi Zyrian-Lattice – Homepage
UD Komi-Zyrian Lattice is a treebank of written standard Komi-Zyrian.
Eberhard Karls Universität Tübingen
Komi
UD Korean-GSD – Homepage
The Google Korean Universal Dependency Treebank is first converted from the [Universal Dependency Treebank v2.0 (legacy)](https://github.com/ryanmcd/uni-dep-tb), and then enhanced by Chun et al., 2018.
Eberhard Karls Universität Tübingen
Korean
UD Korean-Kaist – Homepage
The KAIST Korean Universal Dependency Treebank is generated by Chun et al., 2018 from the constituency trees in the [KAIST Tree-Tagging Corpus](http://semanticweb.kaist.ac.kr/home/index.php/Corpus4).
Eberhard Karls Universität Tübingen
Korean
UD Korean-PUD – Homepage
This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the [CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies](http://universaldependencies.org/conll17/).
Eberhard Karls Universität Tübingen
Korean
UD Kurmanji-MG – Homepage
The UD Kurmanji corpus is a corpus of Kurmanji Kurdish. It contains fiction and encyclopaedic texts in roughly equal measure. It has been annotated natively in accordance with the UD annotation scheme.
Eberhard Karls Universität Tübingen
Northern Kurdish
UD Kyrgyz-KTMU – Homepage
UD_Kyrgyz-KTMU is dependency parsing based treebank in Kyrgyz language. Sentences were selected partly from Kyrgyz story and novel books, partly from Kyrgyz news websites.
Eberhard Karls Universität Tübingen
Kirghiz
UD Kyrgyz-TueCL – Homepage
This is a small treebank of grammatical examples for Kyrgyz.
Eberhard Karls Universität Tübingen
Kirghiz
UD Latgalian-Cairo – Homepage
UD_Latgalian-Cairo is an example treebank to provide minimal dataset for Latgalian based on the Cairo sample sentences. Created by [AI Lab](http://ailab.lv) at Institute of Mathematics and Computer Science, University of Latvia.
Eberhard Karls Universität Tübingen
Latgalian
UD Latin-CIRCSE – Homepage
UD_Latin-CIRCSE is a repository of treebanks featuring Latin texts natively annotated at the CIRCSE Research Centre in Milan (https://centridiricerca.unicatt.it/circse/en.html) following the Universal Dependencies (UD) (https://universaldependencies.org) annotation scheme. The repository includes prose and poetry texts from different periods.
Eberhard Karls Universität Tübingen
Latin
UD Latin-ITTB – Homepage
Latin data from the _Index Thomisticus_ Treebank. Data are taken from the _Index Thomisticus_ corpus by Roberto Busa SJ, which contains the complete work by Thomas Aquinas (1225–1274; Medieval Latin) and by 61 other authors related to Thomas.
Eberhard Karls Universität Tübingen
Latin
UD Latin-LLCT – Homepage
This Universal Dependencies version of the **LLCT** (Late Latin Charter Treebank) consists of an automated conversion of the **LLCT2** treebank from the Latin Dependency Treebank (LDT) format into the Universal Dependencies standard.
Eberhard Karls Universität Tübingen
Latin
UD Latin-Perseus – Homepage
This Universal Dependencies Latin Treebank consists of an automatic conversion of a selection of passages from the Ancient Greek and Latin Dependency Treebank 2.1
Eberhard Karls Universität Tübingen
Latin
UD Latin-PROIEL – Homepage
The Latin PROIEL treebank is based on the Latin data from the PROIEL treebank, and contains most of the Vulgate New Testament translations plus selections from Caesar's Gallic War, Cicero's Letters to Atticus, Palladius' Opus Agriculturae and the first book of Cicero's De officiis.
Eberhard Karls Universität Tübingen
Latin
UD Latin-UDante – Homepage
The **UDante** treebank is based on the Latin texts of Dante Alighieri, taken from the [**DanteSearch corpus**](https://dantesearch.dantenetwork.it), originally created at the University of Pisa, Italy. It is a treebank of Latin language, more precisely of **literary Medieval Latin** (XIVth century).
Eberhard Karls Universität Tübingen
Latin
UD Latvian-Cairo – Homepage
This is an example treebank made to ilustrate UD annotation choices made for Latvian based on the Cairo sample sentences. Created by [AI Lab](http://ailab.lv) at Institute of Mathematics and Computer Science, University of Latvia.
Eberhard Karls Universität Tübingen
Latvian
UD Latvian-LVTB – Homepage
Latvian UD Treebank is based on Latvian Treebank ([LVTB](http://sintakse.korpuss.lv)), being created at University of Latvia, Institute of Mathematics and Computer Science, [Artificial Intelligence Laboratory](http://ailab.lv).
Eberhard Karls Universität Tübingen
Latvian
UD Ligurian-GLT – Homepage
The Genoese Ligurian Treebank is a small, manually annotated collection of contemporary Ligurian prose. The focus of the treebank is written Genoese, the koiné variety of Ligurian which is associated with today's literary, journalistic and academic ligurophone sphere.
Eberhard Karls Universität Tübingen
Ligurian
UD Lithuanian-ALKSNIS – Homepage
The Lithuanian dependency treebank ALKSNIS v3.0 (Vytautas Magnus University).
Eberhard Karls Universität Tübingen
Lithuanian
UD Lithuanian-HSE – Homepage
Lithuanian treebank annotated manually (dependencies) using the Morphological Annotator by CCL, Vytautas Magnus University (http://tekstynas.vdu.lt/) and manual disambiguation. A pilot version which includes news and an essay by Tomas Venclova is available here.
Eberhard Karls Universität Tübingen
Lithuanian
UD Livvi-KKPP – Homepage
UD Livvi-KKPP is a manually annotated new corpus of Livvi-Karelian made directly in the Universal dependencies annotation scheme. The data is collected from [VepKar corpora](http://dictorpus.krc.karelia.ru/en/corpus/text) and consists of mostly modern news texts but also some stories and educational texts.
Eberhard Karls Universität Tübingen
Livvi
UD Low Saxon-LSDC – Homepage
The UD Low Saxon LSDC dataset consists of sentences in 8 major Low Saxon dialect groups from both Germany and the Netherlands. These sentences are (or are to become) part of the LSDC dataset and represent the language from mostly the 19th and early 20th century in genres such as short stories, novels, speeches, letters and fairytales.
Eberhard Karls Universität Tübingen
Low German
UD Luxembourgish-LuxBank – Homepage
The LuxBank corpus currently consists of the translated Cairo Cicling examples, and will be extended to include examples from a national dataset. It is the first comprehensive tree bank dataset for Luxembourgish.
Eberhard Karls Universität Tübingen
Luxembourgish
UD Macedonian-MTB – Homepage
The Macedonian-MTB treebank is a collection of annotated sentences taken from the Macedonian version of the Cairo CICLing Corpus and from the university textbook in syntax "Contemporary Macedonian Language 4" by Simov Sazdov.
Eberhard Karls Universität Tübingen
Macedonian
UD Madi-Jarawara – Homepage
UD_Madi-Jarawara is a collection of annotated sentences in Madí (Jarawara dialect) from a variety of sources, including grammar examples, oral stories, didatic material, and dictionary examples.
Eberhard Karls Universität Tübingen
Ma'di
UD Maghrebi Arabic French-Arabizi – Homepage
A Universal Dependencies corpus for a romanized user-generated content variety of Algerian, a North-African Arabic dialect known for its frequent usage of code-switching. We added to the UD annotations NER annotations extending the French Treebank NER scheme (Sagot et al, 2012) and Offensive language classification and corrected many of the translations (still ongoing).
Eberhard Karls Universität Tübingen
Bolgarian
UD Makurap-TuDeT – Homepage
UD_Makuráp-TuDeT is a collection of annotated texts in Makuráp. The project is a work in progress and the treebank is being updated on a regular basis. The sentences are being annotated by Carolina Aragon, Fabrício Ferraz Gerardi, Luana dos Santos, and Luan Cabral.
Eberhard Karls Universität Tübingen
Makuráp
UD Malayalam-UFAL – Homepage
Currently just a small sample of Malayalam grammatical examples.
Eberhard Karls Universität Tübingen
Malayalam
UD Maltese-MUDT – Homepage
MUDT (Maltese Universal Dependencies Treebank) is a manually annotated treebank of Maltese, a Semitic language of Malta descended from North African Arabic with a significant amount of Italo-Romance influence. MUDT was designed as a balanced corpus with four major genres (see Splitting below) represented roughly equally.
Eberhard Karls Universität Tübingen
Maltese
UD Manx-Cadhan – Homepage
This is the Cadhan Aonair UD treebank for Manx Gaelic, created by Kevin Scannell.
Eberhard Karls Universität Tübingen
Manx
UD Marathi-UFAL – Homepage
UD Marathi is a manually annotated treebank consisting primarily of stories from Wikisource, and parts of an article on Wikipedia.
Eberhard Karls Universität Tübingen
Marathi
UD Mbya Guarani-Dooley – Homepage
UD Mbya_Guarani-Dooley is a corpus of narratives written in Mbyá Guaraní (Tupian) in Brazil, and collected by Robert Dooley. Due to copyright restrictions, the corpus that is distributed as part of UD only contains the annotation (tags, features, relations) while the FORM and LEMMA columns are empty.
Eberhard Karls Universität Tübingen
Guarani
UD Mbya Guarani-Thomas – Homepage
UD Mbya_Guarani-Thomas is a corpus of Mbyá Guaraní (Tupian) texts collected by Guillaume Thomas. The current version of the corpus consists of three speeches by Paulina Kerechu Núñez Romero, a Mbyá Guaraní speaker from Ytu, Caazapá Department, Paraguay.
Eberhard Karls Universität Tübingen
Guarani
UD Middle French-PROFITEROLE – Homepage
UD_Middle_French-PROFITEROLE is the Middle French section of the PROFITEROLE corpus, the Old French section is UD_OLD_FRENCH-PROFITEROLE.
Eberhard Karls Universität Tübingen
Middle French (ca. 1400-1600)
UD Moksha-JR – Homepage
Erme Universal Dependencies annotated texts Moksha are the origin of UD_Moksha-JR with annotation (CoNLL-U) for texts in the Moksha language, it originally consists of a sample from a number of fiction authors writing originals in Moksha.
Eberhard Karls Universität Tübingen
Moksha
UD Munduruku-TuDeT – Homepage
UD_Munduruku-TuDeT is a collection of annotated sentences in [Mundurukú](http://www.endangeredlanguages.com/lang/2981). The project is a work in progress and the treebank is being updated on a regular basis. </br> </br> </br> </br> </br>
Eberhard Karls Universität Tübingen
Mundurukú
UD Nayini-AHA – Homepage
The AHA Nayini Treebank is a small treebank for contemporary Nayini. Its corpus is collected and annotated manually. We have prepared this treebank based on interviews with Nayini speakers.
Eberhard Karls Universität Tübingen
Nayini
UD Neapolitan-RB – Homepage
This treebank contains example sentences in Neapolitan, translated by a native speaker.
Eberhard Karls Universität Tübingen
Neapolitan
UD Nheengatu-CompLin – Homepage
The [UD_Nheengatu-CompLin](https://doi.org/10.5753/stil.2023.234131) is a treebank of [Nheengatu](https://glottolog.org/resource/languoid/id/nhen1239) (ISO-639: `yrl`), also known, inter alia, as Modern Tupi and *Língua Geral Amazônica*. It comprises sentences from diverse published sources, e.g., spontaneous speech, grammatical descriptions, fables, myths, coursebooks, and dictionaries.
Eberhard Karls Universität Tübingen
Nhengatu
UD North Sami-Giella – Homepage
This is a North Sámi treebank based on a manually disambiguated and function-labelled gold-standard corpus of North Sámi produced by the Giellatekno team at UiT Norgga árktalaš universitehta.
Eberhard Karls Universität Tübingen
Northern Sami
UD Norwegian-Bokmaal – Homepage
The Norwegian UD treebank is based on the Bokmål section of the Norwegian Dependency Treebank (NDT), which is a syntactic treebank of Norwegian. The current version of NDT has been automatically converted to the UD scheme by Ingerid Løyning Dale, Per Erik Solberg and Andre Kåsen at the Norwegian Language Bank at the National Library of Norway. This conversion builds to a large extent on previous conversions by Lilja Øvrelid at the University of Oslo.
Eberhard Karls Universität Tübingen
Norwegian
UD Norwegian-Nynorsk – Homepage
The Norwegian UD treebank is based on the Nynorsk section of the Norwegian Dependency Treebank (NDT), which is a syntactic treebank of Norwegian. NDT has been automatically converted to the UD scheme by Lilja Øvrelid at the University of Oslo.
Eberhard Karls Universität Tübingen
Norwegian
UD Old Church Slavonic-PROIEL – Homepage
The Old Church Slavonic (OCS) UD treebank is based on canonical Old Church Slavonic data from the PROIEL and TOROT treebanks.
Eberhard Karls Universität Tübingen
Church Slavic
UD Old East Slavic-Birchbark – Homepage
UD Old\_East\_Slavic-Birchbark is based on the RNC Corpus of Birchbark Letters and includes documents written in 1025-1500 in an East Slavic vernacular (letters, household and business records, records for church services, spell against diseases, and other short inscriptions). The treebank is manually syntactically annotated in the UD 2.0 scheme, morphological and lexical annotation is a conversion of the original RNC annotation.
Eberhard Karls Universität Tübingen
Old Russian
UD Old East Slavic-RNC – Homepage
`UD_Old_East_Slavic-RNC` is a sample of the Middle Russian corpus (1300-1700), a part of the Russian National Corpus. The data were originally annotated according to the RNC and extended UD-Russian morphological schemas and UD 2.4 dependency schema.
Eberhard Karls Universität Tübingen
Old Russian
UD Old East Slavic-Ruthenian – Homepage
The Ruthenian UD treebank includes texts written in the territories of modern Belarus, Lithuania, Ukraine, and Poland in ca. 1300-1700. A sample of legal and nonfiction texts is drawn from the Ruthenian Corpus.
Eberhard Karls Universität Tübingen
Old Russian
UD Old East Slavic-TOROT – Homepage
UD\_Old\_East\_Slavic-TOROT is a conversion of a selection of Old East Slavonic and Middle Russian data from the Tromsø Old Russian and OCS Treebank (TOROT), which was originally annotated in PROIEL dependency format.
Eberhard Karls Universität Tübingen
Old Russian
UD Old French-PROFITEROLE – Homepage
UD_Old_French-PROFITEROLE is an expansion of the previous UD_Old_French-SRCMF (which was a conversion of (part of) the SRCMF corpus (Syntactic Reference Corpus of Medieval French [srcmf.org](http://srcmf.org/)).
Eberhard Karls Universität Tübingen
Old French (842-ca. 1400)
UD Old Irish-DipSGG – Homepage
A Universal Dependencies treebank for the Old Irish glosses of St. Gall.
Eberhard Karls Universität Tübingen
Old Irish (to 900)
UD Old Irish-DipWBG – Homepage
A Universal Dependencies treebank for the Old Irish Würzburg glosses.
Eberhard Karls Universität Tübingen
Old Irish (to 900)
UD Old Turkish-Clausal – Homepage
This repository contains an [Old Turkish](https://iso639-3.sil.org/code/otk) treebank built upon Old Turkic script texts.
Eberhard Karls Universität Tübingen
Old Turkish
UD Ottoman Turkish-BOUN – Homepage
An Ottoman Turkish dependency treebank annotated in UD style. Created by [Şaziye Betül Özateş](https://sb-b.github.io/), Tarık Emre Tıraş, Efe Eren Genç from Boğaziçi University, and Esma Fatıma Bilgin Taşdemir from Medeniyet University.
Eberhard Karls Universität Tübingen
Ottoman Turkish (1500-1928)
UD Ottoman Turkish-DUDU – Homepage
An Ottoman Turkish dependency treebank annotated in UD style. Created by Enes Yılandiloğlu.
Eberhard Karls Universität Tübingen
Ottoman Turkish (1500-1928)
UD Paumari-TueCL – Homepage
This is a small treebank of Paumari, a low-resource Amazonian language.
Eberhard Karls Universität Tübingen
Paumarí
UD Persian-Seraji – Homepage
The Persian Universal Dependency Treebank (Seraji) is based on Uppsala Persian Dependency Treebank (UPDT). The conversion of the UPDT to the Universal Dependencies was performed semi-automatically with extensive manual checks and corrections.
Eberhard Karls Universität Tübingen
Persian
UD Polish-LFG – Homepage
The LFG Enhanced UD treebank of Polish is based on a corpus of LFG (Lexical Functional Grammar) syntactic structures generated by an LFG grammar of Polish, POLFIE, and manually disambiguated by human annotators.
Eberhard Karls Universität Tübingen
Polish
UD Polish-PDB – Homepage
The Polish PDB-UD treebank is automatically converted from the Polish Dependency Bank 2.0 (PDB 2.0). Both treebanks were created at the [Institute of Computer Science, Polish Academy of Sciences](https://ipipan.waw.pl/en/) in Warsaw (Poland).
Eberhard Karls Universität Tübingen
Polish
UD Polish-PUD – Homepage
This is the Polish portion of the Parallel Universal Dependencies (PUD) treebanks, created at the [Institute of Computer Science, Polish Academy of Sciences](https://ipipan.waw.pl/en/) in Warsaw (Poland).
Eberhard Karls Universität Tübingen
Polish
UD Portuguese-Bosque – Homepage
This Universal Dependencies (UD) Portuguese treebank is based on the Constraint Grammar converted version of the Bosque, which is part of the Floresta Sintá(c)tica treebank. It contains both European (CETEMPúblico) and Brazilian (CETENFolha) variants.
Eberhard Karls Universität Tübingen
Portuguese
UD Portuguese-CINTIL – Homepage
CINTIL-UDep is a dependency bank of Portuguese that is treebanked with Universal Dependencies. It contains over 38K annotated sentences (and 476K tokens), of mostly newspaper text.
Eberhard Karls Universität Tübingen
Portuguese
UD Portuguese-GSD – Homepage
The Brazilian Portuguese UD is converted from the [Google Universal Dependency Treebank v2.0 (legacy)](https://github.com/ryanmcd/uni-dep-tb).
Eberhard Karls Universität Tübingen
Portuguese
UD Portuguese-PetroGold – Homepage
UD_Portuguese-PetroGold is a fully revised treebank which consists of academic texts from the oil & gas domain in Brazilian Portuguese.
Eberhard Karls Universität Tübingen
Portuguese
UD Portuguese-Porttinari – Homepage
Porttinari-base [(Duran et al., 2023)](https://sol.sbc.org.br/index.php/stil/article/view/25443/25264) is the journalistic portion of Porttinari (which stands for “PORTuguese Treebank”), which shall be a large multigenre treebank for Portuguese [(Pardo et al., 2021)](https://sol.sbc.org.br/index.php/stil/article/view/17778/17612), following the "Universal Dependencies" international grammar framework [(de Marneffe et al., 2021)](https://aclanthology.org/2021.cl-2.11/).
Eberhard Karls Universität Tübingen
Portuguese
UD Portuguese-PUD – Homepage
This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the [CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies](http://universaldependencies.org/conll17/).
Eberhard Karls Universität Tübingen
Portuguese
UD Romanian-ArT – Homepage
The UD treebank ArT is a treebank of the Aromanian dialect of the Romanian language in UD format.
Eberhard Karls Universität Tübingen
Romanian
UD Romanian-Nonstandard – Homepage
The Romanian Non-standard UD treebank (called UAIC-RoDia) is based on UAIC-RoDia Treebank. UAIC-RoDia = ISLRN 156-635-615-024-0
Eberhard Karls Universität Tübingen
Romanian
UD Romanian-RRT – Homepage
The Romanian UD treebank (called RoRefTrees) (Barbu Mititelu et al., 2016) is the reference treebank in UD format for standard Romanian.
Eberhard Karls Universität Tübingen
Romanian
UD Romanian-SiMoNERo – Homepage
SiMoNERo is a medical corpus of contemporary Romanian.
Eberhard Karls Universität Tübingen
Romanian
UD Romanian-TueCL – Homepage
This is a (currently small) Twitter treebank containing a subset of tweets from [CoRoSeOf](https://github.com/DianaHoefels/CoRoSeOf).
Eberhard Karls Universität Tübingen
Romanian
UD Russian-GSD – Homepage
Russian Universal Dependencies Treebank annotated and converted by Google.
Eberhard Karls Universität Tübingen
Russian
UD Russian-Poetry – Homepage
UD_Russian-Poetry contains samples of Russian poetry written in 19th – early 21th centuries. The treebank is based on the Poetry Corpus of the Russian National Corpus.
Eberhard Karls Universität Tübingen
Russian
UD Russian-PUD – Homepage
This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the [CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies](http://universaldependencies.org/conll17/).
Eberhard Karls Universität Tübingen
Russian
UD Russian-SynTagRus – Homepage
Russian data from the SynTagRus corpus.
Eberhard Karls Universität Tübingen
Russian
UD Russian-Taiga – Homepage
Universal Dependencies treebank is based on data samples extracted from Taiga Corpus and MorphoRuEval-2017 and GramEval-2020 shared tasks collections.
Eberhard Karls Universität Tübingen
Russian
UD Sanskrit-UFAL – Homepage
A small Sanskrit treebank of sentences from Pañcatantra, an ancient Indian collection of interrelated fables by Vishnu Sharma.
Eberhard Karls Universität Tübingen
Sanskrit
UD Sanskrit-Vedic – Homepage
The Treebank of Vedic Sanskrit contains 4,000 sentences with 27,000 words chosen from metrical and prose passages of the Ṛgveda (RV), the Śaunaka recension of the Atharvaveda (ŚS), the Maitrāyaṇīsaṃhitā (MS), and the Aitareya- (AB) and Śatapatha-Brāhmaṇas (ŚB). Lexical and morpho-syntactic information has been generated using a tagging software and manually validated. POS tags have been induced automatically from the morpho-sytactic information of each word.
Eberhard Karls Universität Tübingen
Sanskrit
UD Scottish Gaelic-ARCOSG – Homepage
A treebank of Scottish Gaelic based on the [Annotated Reference Corpus Of Scottish Gaelic (ARCOSG)](https://github.com/Gaelic-Algorithmic-Research-Group/ARCOSG).
Eberhard Karls Universität Tübingen
Scottish Gaelic
UD Serbian-SET – Homepage
The Serbian UD treebank is based on the [SETimes-SR](http://hdl.handle.net/11356/1200) corpus and additional news documents from the Serbian web.
Eberhard Karls Universität Tübingen
Serbian
UD Sinhala-STB – Homepage
This treebank consists contemporary written Sinhala text taken from a 10M corpus maintained by UCSC, Sri Lanka. The corpus contains novels, short stories, Sinhala translations, critiques and Sinhala newspapers.
Eberhard Karls Universität Tübingen
Sinhala
UD Skolt Sami-Giellagas – Homepage
The UD Skolt Sami Giellagas treebank is based almost entirely on spoken Skolt Sami corpora.
Eberhard Karls Universität Tübingen
Skolt Sami
UD Slovak-SNK – Homepage
The Slovak UD treebank is based on data originally annotated as part of the Slovak National Corpus, following the annotation style of the Prague Dependency Treebank.
Eberhard Karls Universität Tübingen
Slovak
UD Slovenian-SSJ – Homepage
The SSJ treebank is the reference UD treebank for Slovenian, consisting of approximately 13,000 sentences and 267,097 tokens from fiction, non-fiction, periodical and Wikipedia texts in standard modern Slovenian. As of UD release 2.10 in May 2022, the original version of the SSJ UD treebank has been partially manually revised and extended with new manually annotated data.
Eberhard Karls Universität Tübingen
Slovenian
UD Slovenian-SST – Homepage
The Spoken Slovenian Treebank (SST) is a manually annotated collection of transcribed audio recordings featuring spontaneous speech in various everyday situations. It includes 344 unique speech events (documents) amounting to approximately 10 hours of speech, encompassing a total of 6,104 utterances and 76,341 tokens.
Eberhard Karls Universität Tübingen
Slovenian
UD Soi-AHA – Homepage
The AHA Soi Treebank is a small treebank for contemporary Soi. Its corpus is collected and annotated manually. We have prepared this treebank based on interviews with Soi speakers.
Eberhard Karls Universität Tübingen
Soi
UD South Levantine Arabic-MADAR – Homepage
The South_Levantine_Arabic-MADAR treebank consists of 100 manually-annotated sentences taken from the [MADAR](https://camel.abudhabi.nyu.edu/madar/) (Multi-Arabic Dialect Applications and Resources) project. TO-DO: Add 20 annotated sentences from CCC as a train set.
Eberhard Karls Universität Tübingen
South Levantine Arabic
UD Spanish-AnCora – Homepage
Spanish data from the [AnCora](http://clic.ub.edu/corpus/) corpus.
Eberhard Karls Universität Tübingen
Spanish
UD Spanish-COSER – Homepage
The COSER UD Treebank (COSER-UD) is the first syntactically annotated corpus of spoken Spanish, based on a sample of the "Corpus Oral y Sonoro del Español Rural" (COSER; Fernández-Ordóñez 2005-present), meaning the "Audible Corpus of Spoken Rural Spanish".
Eberhard Karls Universität Tübingen
Spanish
UD Spanish-GSD – Homepage
The Spanish UD is converted from the content head version of the [universal dependency treebank v2.0 (legacy)](https://github.com/ryanmcd/uni-dep-tb).
Eberhard Karls Universität Tübingen
Spanish
UD Spanish-PUD – Homepage
This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the [CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies](http://universaldependencies.org/conll17/).
Eberhard Karls Universität Tübingen
Spanish
UD Swedish Sign Language-SSLC – Homepage
The Universal Dependencies treebank for Swedish Sign Language (ISO 639-3: swl) is derived from the Swedish Sign Language Corpus (SSLC) from the department of linguistics, Stockholm University.
Eberhard Karls Universität Tübingen
Swedish Sign Language
UD Swedish-LinES – Homepage
UD Swedish_LinES is the Swedish half of the LinES Parallel Treebank with UD annotations. All segments are translations from English and the sources cover literary genres, online manuals and Europarl data.
Eberhard Karls Universität Tübingen
Swedish
UD Swedish-PUD – Homepage
Swedish-PUD is the Swedish part of the Parallel Universal Dependencies (PUD) treebanks.
Eberhard Karls Universität Tübingen
Swedish
UD Swedish-Talbanken – Homepage
The Swedish-Talbanken treebank is based on Talbanken, a treebank developed at Lund University in the 1970s.
Eberhard Karls Universität Tübingen
Swedish
UD Swiss German-UZH – Homepage
_UD\_Swiss\_German-UZH_ is a tiny manually annotated treebank of 100 sentences in different Swiss German dialects and a variety of text genres.
Eberhard Karls Universität Tübingen
Swiss German
UD Tagalog-TRG – Homepage
UD_Tagalog-TRG is a UD treebank manually annotated using sentences from a grammar book.
Eberhard Karls Universität Tübingen
Tagalog
UD Tagalog-Ugnayan – Homepage
Ugnayan is a manually annotated Tagalog treebank currently composed of educational fiction and nonfiction text. The treebank is under development at the University of the Philippines.
Eberhard Karls Universität Tübingen
Tagalog
UD Tamil-MWTT – Homepage
MWTT - Modern Written Tamil Treebank has sentences taken primarily from a text called "A Grammar of Modern Tamil by Thomas Lehmann (1993). This initial release has 536 sentences of various lengths, and all of these are added as the test set.
Eberhard Karls Universität Tübingen
Tamil
UD Tamil-TTB – Homepage
The UD Tamil treebank is based on the Tamil Dependency Treebank created at the Charles University in Prague by Loganathan Ramasamy.
Eberhard Karls Universität Tübingen
Tamil
UD Tatar-NMCTT – Homepage
UD Tatar-NMCTT is a manually annotated corpus of the Tatar language based on the text from Tatar-Inform (tatar-inform.tatar), an online news website.
Eberhard Karls Universität Tübingen
Tatar
UD Teko-TuDeT – Homepage
UD_Teko-TuDeT is a collection of annotated sentences in <a href="https://glottolog.org/resource/languoid/id/emer1243"> Tekó (Emérillon) </a>. The sentences stem from the only grammatical description of the language (Rose, 2011). Sentence annotation and documantation by Uliana Vedenina and Fabrício Ferraz Gerardi.
Eberhard Karls Universität Tübingen
tkk
UD Telugu-MTG – Homepage
The Telugu UD treebank is created in UD based on manual annotations of sentences from a grammar book.
Eberhard Karls Universität Tübingen
Telugu
UD Thai-PUD – Homepage
This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the [CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies](http://universaldependencies.org/conll17/).
Eberhard Karls Universität Tübingen
Thai
UD Tswana-Popapolelo – Homepage
UD Tswana-Popapolelo is a translation of the 20 Cairo Cicling sentences (https://github.com/UniversalDependencies/cairo) annotated with XPOS, UPOS and dependency relations.
Eberhard Karls Universität Tübingen
Tswana
UD Tupinamba-TuDeT – Homepage
UD_Tupinamba-TuDeT is a collection of annotated sentences in [Tupinambá](https://glottolog.org/resource/languoid/id/tupi1273). All known sources in this language are being annotated: cathecisms, letters, poems, theater plays, and grammars (sixteenth and seventeenth century). Sentence annotation and documentation by [Fabrício Ferraz Gerardi](https://languagestructure.github.io).
Eberhard Karls Universität Tübingen
Tupinambá
UD Turkish-Atis – Homepage
This treebank is a translation of English ATIS (Airline Travel Information System) corpus (see References). It consists of 5432 sentences.
Eberhard Karls Universität Tübingen
Turkish
UD Turkish-BOUN – Homepage
A Turkish dependency treebank annotated in UD style. Created by the members of [TABILAB](https://tabilab.cmpe.boun.edu.tr/) from Boğaziçi University.
Eberhard Karls Universität Tübingen
Turkish
UD Turkish-FrameNet – Homepage
Turkish FrameNet consists of 2,700 manually annotated example sentences and 19,221 tokens. Its data consists of the sentences taken from the Turkish FrameNet Project. The annotated sentences can be filtered according to the semantic frame category of the root of the sentence.
Eberhard Karls Universität Tübingen
Turkish
UD Turkish-GB – Homepage
This is a treebank annotating example sentences from a comprehensive grammar book of Turkish.
Eberhard Karls Universität Tübingen
Turkish
UD Turkish-IMST – Homepage
The UD Turkish Treebank, also called the IMST-UD Treebank, is a semi-automatic conversion of the IMST Treebank (Sulubacak&Eryiğit, 2018; Sulubacak et al., 2016).
Eberhard Karls Universität Tübingen
Turkish
UD Turkish-Kenet – Homepage
Turkish-Kenet UD Treebank is the biggest treebank of Turkish. It consists of 18,700 manually annotated sentences and 178,700 tokens. Its corpus consists of dictionary examples.
Eberhard Karls Universität Tübingen
Turkish
UD Turkish-Penn – Homepage
Turkish version of the Penn Treebank. It consists of a total of 9,560 manually annotated sentences and 87,367 tokens. (It only includes sentences up to 15 words long.)
Eberhard Karls Universität Tübingen
Turkish
UD Turkish-PUD – Homepage
This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the [CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies](http://universaldependencies.org/conll17/).
Eberhard Karls Universität Tübingen
Turkish
UD Turkish-Tourism – Homepage
Turkish Tourism is a domain specific treebank consisting of 19,750 manually annotated sentences and 92,200 tokens. These sentences were taken from the original customer reviews of a tourism company.
Eberhard Karls Universität Tübingen
Turkish
UD Ukrainian-IU – Homepage
Gold standard Universal Dependencies corpus for Ukrainian, developed for UD originally, by [Institute for Ukrainian](https://mova.institute), NGO. [[українською](https://mova.institute/золотий_стандарт)]
Eberhard Karls Universität Tübingen
Ukrainian
UD Umbrian-IKUVINA – Homepage
UD_Umbrian-IKUVINA is a dependency treebank rendering of the Iguvine tablets ([Wikipedia](https://en.wikipedia.org/wiki/Iguvine_Tablets)). The seven bronze tablets describe religious ceremonies performed by the Umbrian people in Italy before the rise of the Roman empire. The corpus will eventually contain all the tablets. But as of May 2022, only tablet I is release with partial morphological analysis and partial lemmatisation. (POS tagging and Dependency trees are complete)
Eberhard Karls Universität Tübingen
Umbrian
UD Upper Sorbian-UFAL – Homepage
A small treebank of Upper Sorbian based mostly on Wikipedia.
Eberhard Karls Universität Tübingen
Upper Sorbian
UD Urdu-UDTB – Homepage
The Urdu Universal Dependency Treebank was automatically converted from Urdu Dependency Treebank (UDTB) which is part of an ongoing effort of creating multi-layered treebanks for Hindi and Urdu.
Eberhard Karls Universität Tübingen
Urdu
UD Uyghur-UDT – Homepage
The Uyghur UD treebank is based on the Uyghur Dependency Treebank (UDT), created at the Xinjiang University in Ürümqi, China.
Eberhard Karls Universität Tübingen
Uighur
UD Veps-VWT – Homepage
UD Veps-VWT is a manually annotated corpus of Veps made in Universal dependencies annotation scheme. The data is collected from [VepKar corpora](http://dictorpus.krc.karelia.ru/en/corpus/text) and consists of mostly modern news texts written in Central Veps dialect.
Eberhard Karls Universität Tübingen
Veps
UD Vietnamese-TueCL – Homepage
This treebank includes a set of sentences from [OPUS](https://opus.nlpl.eu/), sourced from subtitles, talks, and educational videos.
Eberhard Karls Universität Tübingen
Vietnamese
UD Vietnamese-VTB – Homepage
The Vietnamese UD treebank is a conversion of the constituent treebank created in the VLSP project (https://vlsp.hpda.vn/).
Eberhard Karls Universität Tübingen
Vietnamese
UD Warlpiri-UFAL – Homepage
A small treebank of grammatical examples in Warlpiri, taken from linguistic literature.
Eberhard Karls Universität Tübingen
Warlpiri
UD Welsh-CCG – Homepage
UD Welsh-CCG (Corpws Cystrawennol y Gymraeg) is a treebank of Welsh, annotated according to the Universal Dependencies guidelines.
Eberhard Karls Universität Tübingen
Welsh
UD Western Armenian-ArmTDP – Homepage
A Universal Dependencies treebank for Western Armenian developed for UD originally by the ArmTDP team led by Marat M. Yavrumyan at the Yerevan State University.
Eberhard Karls Universität Tübingen
hyw
UD Western Sierra Puebla Nahuatl-ITML – Homepage
UD Western Sierra Puebla Nahuatl-IU is a treebank consisting of sentences from written fiction and non-fiction, spontaenous speech, and grammar examples.
Eberhard Karls Universität Tübingen
Western Huasteca Nahuatl
UD Wolof-WTB – Homepage
UD_Wolof-WTB is a natively manual developed treebank for Wolof. Sentences were collected from encyclopedic, fictional, biographical, religious texts and news.
Eberhard Karls Universität Tübingen
Wolof
UD Xavante-XDT – Homepage
UD_Xavante-XDT is a collection of annotated sentences in [Xavante](https://glottolog.org/resource/languoid/id/xava1240). Sentence annotation and documentation by [Fabrício Ferraz Gerardi](http://languagestructure.github.io/), Ivan Roksandic.
Eberhard Karls Universität Tübingen
Xavánte
UD Xibe-XDT – Homepage
The UD Xibe Treebank is a corpus of the Xibe language (ISO 639-3: *sjo*) containing manually annotated syntactic trees under the Universal Dependencies. Sentences come from three sources: grammar book examples, newspaper (Cabcal News) and Xibe textbooks.
Eberhard Karls Universität Tübingen
Xibe
UD Yakut-YKTDT – Homepage
UD_Yakut-YKTDT is a collection Yakut ([Sakha]) sentences (https://glottolog.org/resource/languoid/id/yaku1245). The project is work-in-progress and the treebank is being updated on a regular basis.
Eberhard Karls Universität Tübingen
Yakut
UD Yoruba-YTB – Homepage
Parts of the Yoruba Bible and of the Yoruba edition of Wikipedia, hand-annotated natively in Universal Dependencies.
Eberhard Karls Universität Tübingen
Yoruba
UD Yupik-SLI – Homepage
UD_Yupik-SLI is a treebank of St. Lawrence Island Yupik (ISO 639-3: ess) that has been manually annotated at the morpheme level, based on a finite-state morphological analyzer by [Chen et al., 2020](https://www.aclweb.org/anthology/2020.lrec-1.326). The word-level annotation, merging multiword expressions, is provided in not-to-release/ess_sli-ud-test.merged.conllu. More information about the treebank can be found in our publication (AmericasNLP, 2021).
Eberhard Karls Universität Tübingen
Central Yupik
Wikipedia-2019-Universal-Dependencies (formally:wiki2019) – Homepage
This treebank contains the Wikipedia part of the TüBa-D/DP. It contains part-of-speech, morphology, lemmas, topological fields and dependency annotations. Note that these dependency annotations are Universal Dependencies obtained with a neural-based parser. (https://github.com/stickeritis/sticker) and that the treebank will be released as part of the TüBa-D/DP release 5. For more information / licensing and availability, please check the following Link: https://sfb833-a3.github.io/tueba-ddp/
Eberhard Karls Universität Tübingen
German
Corpus Gysseling – Homepage
The Corpus Gysseling consists of the collection of all thirteenth-century texts that have served as source material for the Early Middle Dutch Dictionary. It is the digital edition, enriched with part of speech and lemma, of the thirteenth-century material from the Corpus of Middle Dutch texts (until the year 1300), issued in the period from 1977 to 1987 by the Ghent linguist Maurits Gysseling.
Instituut voor de Nederlandse Taal
Dutch
Corpus of Contemporary Dutch – Homepage
A corpus of modern Dutch, containing more than 800,000 texts taken from newspapers, books, magazines, news broadcasts and legal writings. The corpus is a combination of the 5, 27 and 38 Million Words Corpora and the PAROLE Corpus, supplemented with more recent material from the Netherlands, Flanders, Surinam and the Netherlands Antilles.
Instituut voor de Nederlandse Taal
Dutch
Letters as Loot – Homepage
This corpus contains letters from the 17th and 18th century, written by sailors, their relatives and acquaintances. They were taken as loot by privateers during one of the four wars between Great Britain and the Dutch Republic in the 17th and 18th Century and ended up in the National Archives in Kew. This corpus contains a selection of these letters and was built in the Letters as Loot project of Prof. dr. M. van der Wal.
Instituut voor de Nederlandse Taal
Dutch
Nederlab (Meertens Institute) – Homepage
From the earliest Middle Dutch to Dutch from the twenty-first century. In the Nederlab research portal, millions of old and new texts have been made searchable for linguists, literary experts, historians and cultural scientists for the first time in one place. 41 million text documents, 18 billion words; Newspapers, novels, biblical texts, diary fragments, correspondence, charters, prayers and more. From the collections of scientific and heritage institutions Advanced search and analysis options, statistics and visualisations. The texts are linguistically annotated and can be search, a.o., on lemma and part of speech.
Instituut voor de Nederlandse Taal
Dutch
OpenSoNaR – Homepage
The over 500 million word Dutch reference corpus SoNaR developed within the STEVIN programme under the aegis of the Dutch Language Union.
Instituut voor de Nederlandse Taal
Dutch
FOLK: Public Interaction – Homepage
Auszüge mit öffentlichen Geprächen, Debates and Discussions from the “Forschungs- und Lernkorpus für gesprochenes Deutsch” (FOLK) from the Database of Spoken German (Datenbank Gesprochenes Deutsch) by the Institute for the German Language (Institut für Deutsche Sprache)
Leibniz-Institut für Deutsche Sprache
German
German Wikipedia Articles 2017 – Homepage
A collection of articles of German Wikipedia from July 1st, 2017.
Leibniz-Institut für Deutsche Sprache
German
German Wikipedia talk corpus 2017 – Homepage
A collection of talk pages of German Wikipedia from July 1st, 2017.
Leibniz-Institut für Deutsche Sprache
German
German Wikipedia user talk corpus 2017 – Homepage
A collection of user talk pages of German Wikipedia from July 1st, 2017.
Leibniz-Institut für Deutsche Sprache
German
Goethe Korpus – Homepage
The Goethe corpus of IDS Mannheim.
Leibniz-Institut für Deutsche Sprache
German
TextGrid Digital Library (Literature) – Homepage
The literature folder from TextGrid's digital library. (CLARIN-FCS enabled search engine hosted at IDS Mannheim)
Leibniz-Institut für Deutsche Sprache
German
CINTIL - International Corpus of Portuguese – Homepage
CINTIL - Corpus Internacional do Português is a linguistically interpreted corpus of Portuguese.
PORTULAN CLARIN Research Infrastructure for the Science and Technology of Language
Portuguese
DK-CLARIN LSP Corpus - Agriculture domain – Homepage
Texts in the Agriculture domain come from Danmarks JordbrugsForskning.
The CLARIN Centre at the University of Copenhagen
Danish
DK-CLARIN LSP Corpus - Construction domain – Homepage
Construction Domain Corpus has been collected from Statens Byggeforskningsinstitut, Erhvervs- og byggestyrelsen and Murerfagets Oplysningsråd as part of the DK-CLARIN project.
The CLARIN Centre at the University of Copenhagen
Danish
DK-CLARIN LSP Corpus - Economics domain – Homepage
Texts in the Economics domain come from SKAT, Finanstilsynet and Erhvervs- og Selskabsstyrelsen and have been collected in the DK-CLARIN project.
The CLARIN Centre at the University of Copenhagen
Danish
DK-CLARIN LSP Corpus - Environment domain – Homepage
Texts in the Environment Domain come from Hovedland, Danske Miljøundersøgelser, Det Økologiske Råd and Aktuel Naturvidenskab(via DMI).
The CLARIN Centre at the University of Copenhagen
Danish
DK-CLARIN LSP Corpus - Health domain – Homepage
Texts in the Health and Medicine Domain come from netpatient.dk, Søfartsstyrelsen, Sundhedsstyrelsen, regionH, Libris, Aktuel Naturvidenskab and have been collected in the DK-CLARIN project.
The CLARIN Centre at the University of Copenhagen
Danish
DK-CLARIN LSP Corpus - IT domain – Homepage
Texts in the IT Domain come from Libris, Open Office, Aktuel Naturvidenskab and have been collected in the DK-CLARIN project.
The CLARIN Centre at the University of Copenhagen
Danish
DK-CLARIN LSP Corpus - Nanotechnology domain – Homepage
Texts in the Nanotechnology domain come from iNano (Interdisciplinary Nanoscience Center, AU), Nano (DTU), Niels Bohr Institutet, Forskningscenter Risø, Ministeriet for Sundhed og Forebyggelse (via DTU), Miljøstyrelsen, Aktuel Naturvidenskab and have been collected in the DK-CLARIN project.
The CLARIN Centre at the University of Copenhagen
Danish
The ILC4CLARIN corpora – Homepage
Optional description not given for resource The ILC4CLARIN corpora
The ILC4CLARIN Centre at the Institute for Computational Linguistics
Italian
CLMET 3.1 – Homepage
CLMET3.1 is a principled collection of public domain texts drawn from various online archiving projects.
Universität des Saarlandes
English
GRUG Treebank German – Homepage
GRUG - Georgian, Russian, Ukrainian, German parallel treebank
Universität des Saarlandes
German
PolDiLemma – Homepage
PolDiLemma - Middle Polish Diachrone Lemmatised Corpus
Universität des Saarlandes
Polish
Royal Society Corpus Version 2.0 – Homepage
The Royal Society Corpus (RSC) is based on the first two centuries of the Philosophical Transactions of the Royal Society of London from its beginning in 1665 to 1869. It includes all publications of the journal written mainly in English and containing running text.
Universität des Saarlandes
English
The Old Bailey Corpus – Homepage
The Old Bailey Corpus is based on the Proceedings of the Old Bailey and documents spoken English in the courtroom from 1720 to 1913
Universität des Saarlandes
English
Unknown Institution, FCS v2.0
Swedish