Skip to content

Supported languages

The following languages are available, identified by their BCP 47 language tag, which typically corresponds to the ISO 639-1 code. If no such code exists, a ISO 639-3 code is used instead.

Available languages (2026-05-29):

The Forms column counts the inflected word forms stored in the dictionary, while Lemmata counts the distinct base forms they map to (both in thousands). A large gap between the two reflects rich morphology rather than a data error.

Code Language Forms (10³) Lemm. (10³) Acc. Comments
ast Asturian 154 36
bg Bulgarian 215 18
ca Catalan 640 63
cs Czech 200 26 0.89 on UD CS-PDT
cy Welsh 363 14
da Danish 555 81 0.92 on UD DA-DDT, alternative: lemmy
de German 730 246 0.95 on UD DE-GSD, see also German-NLP list
el Greek 185 21 0.88 on UD EL-GDT
en English 139 50 0.94 on UD EN-GUM, alternative: LemmInflect
enm Middle English 43 6
eo Esperanto 191 18
es Spanish 666 72 0.95 on UD ES-GSD
et Estonian 141 34 low coverage
fa Persian 13 4 experimental
fi Finnish 3,549 124 see this benchmark
fr French 248 37 0.94 on UD FR-GSD
ga Irish 399 46
gd Gaelic 59 12
gl Galician 426 43
gv Manx 76 13
hbs Serbo-Croatian 674 52 Croatian and Serbian lists to be added later
hi Hindi 58 11 experimental
hu Hungarian 492 36
hy Armenian 247 7
id Indonesian 21 4 0.91 on UD ID-CSUI
is Icelandic 177 15
it Italian 357 28 0.93 on UD IT-ISDT
ka Georgian 66 4
la Latin 892 52
lb Luxembourgish 306 79
lt Lithuanian 268 25
lv Latvian 166 14
mk Macedonian 67 16
ms Malay 18 4
nb Norwegian (Bokmål) 618 134
nl Dutch 366 124 0.92 on UD-NL-Alpino
nn Norwegian (Nynorsk) 68 18
pl Polish 3,670 264 0.91 on UD-PL-PDB
pt Portuguese 924 94 0.92 on UD-PT-GSD
ro Romanian 342 36
ru Russian 633 54 alternative: pymorphy2
se Northern Sámi 115 7
sk Slovak 889 71 0.92 on UD SK-SNK
sl Slovene 165 30
sq Albanian 38 5
sv Swedish 745 93 alternative: lemmy
sw Swahili 4,870 4 experimental
tl Tagalog 39 8 experimental
tr Turkish 1,236 40 0.89 on UD-TR-Boun
uk Ukrainian 388 22 alternative: pymorphy2

Languages marked as having low coverage may be better suited to language-specific libraries, but Simplemma can still provide limited functionality. Where possible, open-source Python alternatives are referenced.

Experimental mentions indicate that the language remains untested or that there could be issues with the underlying data or lemmatization process.

The scores are calculated on Universal Dependencies treebanks on single word tokens (including some contractions but not merged prepositions), they describe to what extent simplemma can accurately map tokens to their lemma form. See the training/ folder of the code repository for more information.

This library is particularly relevant as regards the lemmatization of less frequent words. Its performance in this case is only incidentally captured by the benchmark above. In some languages, a fixed number of words such as pronouns can be further mapped by hand to enhance performance.