Supported languages
The following languages are available, identified by their BCP 47 language tag, which typically corresponds to the ISO 639-1 code. If no such code exists, a ISO 639-3 code is used instead.
Available languages (2026-05-29):
The Forms column counts the inflected word forms stored in the dictionary, while Lemmata counts the distinct base forms they map to (both in thousands). A large gap between the two reflects rich morphology rather than a data error.
| Code | Language | Forms (10³) | Lemm. (10³) | Acc. | Comments |
|---|---|---|---|---|---|
ast |
Asturian | 154 | 36 | ||
bg |
Bulgarian | 215 | 18 | ||
ca |
Catalan | 640 | 63 | ||
cs |
Czech | 200 | 26 | 0.89 | on UD CS-PDT |
cy |
Welsh | 363 | 14 | ||
da |
Danish | 555 | 81 | 0.92 | on UD DA-DDT, alternative: lemmy |
de |
German | 730 | 246 | 0.95 | on UD DE-GSD, see also German-NLP list |
el |
Greek | 185 | 21 | 0.88 | on UD EL-GDT |
en |
English | 139 | 50 | 0.94 | on UD EN-GUM, alternative: LemmInflect |
enm |
Middle English | 43 | 6 | ||
eo |
Esperanto | 191 | 18 | ||
es |
Spanish | 666 | 72 | 0.95 | on UD ES-GSD |
et |
Estonian | 141 | 34 | low coverage | |
fa |
Persian | 13 | 4 | experimental | |
fi |
Finnish | 3,549 | 124 | see this benchmark | |
fr |
French | 248 | 37 | 0.94 | on UD FR-GSD |
ga |
Irish | 399 | 46 | ||
gd |
Gaelic | 59 | 12 | ||
gl |
Galician | 426 | 43 | ||
gv |
Manx | 76 | 13 | ||
hbs |
Serbo-Croatian | 674 | 52 | Croatian and Serbian lists to be added later | |
hi |
Hindi | 58 | 11 | experimental | |
hu |
Hungarian | 492 | 36 | ||
hy |
Armenian | 247 | 7 | ||
id |
Indonesian | 21 | 4 | 0.91 | on UD ID-CSUI |
is |
Icelandic | 177 | 15 | ||
it |
Italian | 357 | 28 | 0.93 | on UD IT-ISDT |
ka |
Georgian | 66 | 4 | ||
la |
Latin | 892 | 52 | ||
lb |
Luxembourgish | 306 | 79 | ||
lt |
Lithuanian | 268 | 25 | ||
lv |
Latvian | 166 | 14 | ||
mk |
Macedonian | 67 | 16 | ||
ms |
Malay | 18 | 4 | ||
nb |
Norwegian (Bokmål) | 618 | 134 | ||
nl |
Dutch | 366 | 124 | 0.92 | on UD-NL-Alpino |
nn |
Norwegian (Nynorsk) | 68 | 18 | ||
pl |
Polish | 3,670 | 264 | 0.91 | on UD-PL-PDB |
pt |
Portuguese | 924 | 94 | 0.92 | on UD-PT-GSD |
ro |
Romanian | 342 | 36 | ||
ru |
Russian | 633 | 54 | alternative: pymorphy2 | |
se |
Northern Sámi | 115 | 7 | ||
sk |
Slovak | 889 | 71 | 0.92 | on UD SK-SNK |
sl |
Slovene | 165 | 30 | ||
sq |
Albanian | 38 | 5 | ||
sv |
Swedish | 745 | 93 | alternative: lemmy | |
sw |
Swahili | 4,870 | 4 | experimental | |
tl |
Tagalog | 39 | 8 | experimental | |
tr |
Turkish | 1,236 | 40 | 0.89 | on UD-TR-Boun |
uk |
Ukrainian | 388 | 22 | alternative: pymorphy2 |
Languages marked as having low coverage may be better suited to language-specific libraries, but Simplemma can still provide limited functionality. Where possible, open-source Python alternatives are referenced.
Experimental mentions indicate that the language remains untested or that there could be issues with the underlying data or lemmatization process.
The scores are calculated on Universal
Dependencies treebanks on single
word tokens (including some contractions but not merged prepositions),
they describe to what extent simplemma can accurately map tokens to
their lemma form. See the training/ folder of the code repository for
more information.
This library is particularly relevant as regards the lemmatization of less frequent words. Its performance in this case is only incidentally captured by the benchmark above. In some languages, a fixed number of words such as pronouns can be further mapped by hand to enhance performance.