Supported languages
The following languages are available using their ISO 639-1 code:
Available languages (2022-09-05)
Code | Language | Forms (10³) | Acc. | Comments |
---|---|---|---|---|
bg |
Bulgarian | 213 | ||
ca |
Catalan | 579 | ||
cs |
Czech | 187 | 0.88 | on UD CS-PDT |
cy |
Welsh | 360 | ||
da |
Danish | 554 | 0.92 | on UD DA-DDT, alternative: lemmy |
de |
German | 682 | 0.95 | on UD DE-GSD, see also German-NLP list |
el |
Greek | 183 | 0.88 | on UD EL-GDT |
en |
English | 136 | 0.94 | on UD EN-GUM, alternative: LemmInflect |
enm |
Middle English | 38 | ||
es |
Spanish | 720 | 0.94 | on UD ES-GSD |
et |
Estonian | 133 | low coverage | |
fa |
Persian | 10 | experimental | |
fi |
Finnish | 2,106 | evaluation and alternatives: see this benchmark | |
fr |
French | 217 | 0.94 | on UD FR-GSD |
ga |
Irish | 383 | ||
gd |
Gaelic | 48 | ||
gl |
Galician | 384 | ||
gv |
Manx | 62 | ||
hbs |
Serbo-Croatian | 838 | Croatian and Serbian lists to be added later | |
hi |
Hindi | 58 | experimental | |
hu |
Hungarian | 458 | ||
hy |
Armenian | 323 | ||
id |
Indonesian | 17 | 0.91 | on UD ID-CSUI |
is |
Icelandic | 175 | ||
it |
Italian | 333 | 0.93 | on UD IT-ISDT |
ka |
Georgian | 65 | ||
la |
Latin | 850 | ||
lb |
Luxembourgish | 305 | ||
lt |
Lithuanian | 247 | ||
lv |
Latvian | 168 | ||
mk |
Macedonian | 57 | ||
ms |
Malay | 14 | ||
nb |
Norwegian (Bokmål) | 617 | ||
nl |
Dutch | 254 | 0.91 | on UD-NL-Alpino |
nn |
Norwegian (Nynorsk) | |||
pl |
Polish | 3,733 | 0.91 | on UD-PL-PDB |
pt |
Portuguese | 933 | 0.92 | on UD-PT-GSD |
ro |
Romanian | 311 | ||
ru |
Russian | 607 | alternative: pymorphy2 | |
se |
Northern Sámi | 113 | experimental | |
sk |
Slovak | 846 | 0.92 | on UD SK-SNK |
sl |
Slovene | 136 | ||
sq |
Albanian | 35 | ||
sv |
Swedish | 658 | alternative: lemmy | |
sw |
Swahili | 10 | experimental | |
tl |
Tagalog | 33 | experimental | |
tr |
Turkish | 1,333 | 0.88 | on UD-TR-Boun |
uk |
Ukrainian | 190 | alternative: pymorphy2 |
Low coverage mentions means one would probably be better off with a language-specific library, but simplemma will work to a limited extent. Open-source alternatives for Python are referenced if possible.
Experimental mentions indicate that the language remains untested or that there could be issues with the underlying data or lemmatization process.
The scores are calculated on Universal Dependencies treebanks on single word tokens (including some contractions but not merged prepositions), they describe to what extent simplemma can accurately map tokens to their lemma form. They can be reproduced by concatenating all available UD files and by using the script udscore.py
in the training/
folder.
This library is particularly relevant as regards the lemmatization of less frequent words. Its performance in this case is only incidentally captured by the benchmark above. In some languages, a fixed number of words such as pronouns can be further mapped by hand to enhance performance.