This is a brief description of how the evaluation of POS-tagging was done (relevant for our model descriptions).

There are three measures: **exact match** (both POS and MSD are correct), **POS** (POS is correct, MSD does not matter), MSD. The latter measure is slightly more complicated: it's the Jaccard similarity index, i.e. the number of correctly identified category values divided by the total number of different categories in both gold and test tags (=the length of the union of the gold and test category sets). The metric does not depend on whether POS is correctly identified. Examples:
• (NN)NEU.-.-.SMS vs. (NN)UTR.-.-.SMS = 0.75 (four categories, three correct)
• (VB)PRS.SFO vs. (VB)INF.SFO = 0.33 (three categories: verb form (INF), tense (PRS), voice (SFO); one correct)
• (JJ)POS.UTR.SIN.IND.NOM vs. (PC)PRF.UTR.SIN.IND.NOM = 0.67 (six categories, four correct)
• (JJ)POS.UTR+NEU.PLU.IND+DEF.NOM vs. (NN)UTR.SIN.IND.NOM = 0.2 (five categories, one correct)