                              MAÞiR Ord
                             MAÞiR Words

                         v1.0, December 2020

                    Gerlof Bouma and Yvonne Adesam
                          Spraakbanken Text
                           Dept of Swedish
                       University of Gothenburg


CONTENTS

* License
* Archive contents
* Acknowledgements


LICENSE

The contents of this archive – MAÞiR Ord / MAÞiR Words – are released
under a Creative Commons Attribution 4.0 International license (CC BY
4.0), by Bouma/Adesam, Språkbanken Text, University of Gothenburg.

The two titles are intended for use in Swedish- / English-language
contexts, respectively. An acceptable alternative rendering of MAÞiR is
MATHiR.


ARCHIVE CONTENTS

The files contain information extracted from digitized versions of
Söderwall's dictionary

  Knut Frederik Söderwall, 1891–1918, Ordbok öfver svenska
  medeltids-språket, 3 vols. Svenska fornskriftsällskapet, Series 1,
  Svenska skrifter 27,

henceforth Sdw.

All files in this archive are headerless tab-separated-values (.tsv)
files. Note that the values (fields) may contain spaces. The files
break up in two groups, one group with 4 columns of information, and
one with 5.

* lemma-psdlemma-sdwposc-sdwposf_Adj.tsv
* lemma-psdlemma-sdwposc-sdwposf_Adv.tsv
* lemma-psdlemma-sdwposc-sdwposf_N.tsv
* lemma-psdlemma-sdwposc-sdwposf_V.tsv
* lemma-psdlemma-sdwposc-sdwposf_ninfl.tsv

These files contain lemmata from Söderwall's dictionary, and are
comprised of the following 4 columns of information:

  1. a lemma, taken from Sdw;

     In general, lemmata may contain spaces, just like in Sdw. In
     _V.tsv, lemmata may also contain a plus sign, '+' – not present
     in Sdw – to separate a verb from its separable particle. Lemmata
     for compounds, which are placed in sub-entries in Sdw, are given
     here as separate entries. They are not distinguished from main
     entry lemmata.

  2. a 'pseudolemma';

     In _Adj.tsv, _Adv.tsv, and_N.tsv this is just the lemma
     itself. In _V.tsv, if the lemma contains a separable particle,
     this field will contain only the (verbal) head. In _ninfl.tsv,
     this field may contain an inflected form, normalized to mimic
     Sdw's lemma-orthographic principles – these were added by us and
     not present in Sdw.

     We use this as a way to add full form information for certain
     lexemes, which may make lemmatization for irregularly inflected
     word an easier task. In total, we have added about 400 of these
     (normalized) inflected forms.

  3. a coarse-grained part-of-speech indication;

     These are groupings of Sdw's indications, and they are as follows:

       adj - adjective
       adv - adverb
	   art - article
	   f - feminine noun
	   infinitivmärke - infinitival marker
	   interj - interjektion
	   konj - konjunktion
	   m - masculine noun
	   n - neutral gender noun
	   prep - preposition
	   pron - pronoun
	   räkn - numeral
	   v - verb
	   x - (unspecified)

  4. a fine-grainded part-of-speech indication.

     We refer to Sdw for an explanation of these.

     (The division into files _Adj.tsv, _Adv.tsv, etc, etc, is only
     loosely related to Sdw's parts of speech, and is should not be
     considered meaningful at this stage. As explained above, it is
     however relevant for the contents of the pseudolemma column.)


* lemma-psdlemma-prnform-sdwposc-sdwposf_Adj.tsv
* lemma-psdlemma-prnform-sdwposc-sdwposf_Adv.tsv
* lemma-psdlemma-prnform-sdwposc-sdwposf_N.tsv
* lemma-psdlemma-prnform-sdwposc-sdwposf_V.tsv
* lemma-psdlemma-prnform-sdwposc-sdwposf_ninfl.tsv

A collection of lemmata with (almost) attested forms, as supplied in
Sdw. The columns are:

  1. lemma, as (1) above;

  2. pseudolemma, as (2) above;

  3. a parenthetical form;

     These are attested forms, or slightly generalized such, as
     supplied in the so-called 'form parentheses' in Sdw – hence the
     header. These forms show inflectional as well as spelling-related
     variation.

  4. coarse-grained part-of-speech, as (3) above;

  5. fine-grained part-of-speech, as (4) above.

Together, the files contain about 24k parenthetical forms for a 8k5
different lemmata.


ACKNOWLEDGEMENTS

These lexical resources were compiled as part of the project

  MAÞiR: Methods for the automatic Analysis of Text in digital
  Historical Resources (Marcus and Amalia Wallenberg Foundation;
  no. 2012.0146, to Gerlof Bouma and Yvonne Adesam),

with support from Språkbanken Text, Dept Swedish, University of
Gothenburg. Språkbanken Text is part of part of

  Nationella Språkbanken (funded by 10 partner institutions and
  the Swedish Research Council; no. 2017-00626).
