Supported languages

The following languages are available, identified by their BCP 47 language tag, which typically corresponds to the ISO 639-1 code. If no such code exists, a ISO 639-3 code is used instead.

Available languages (2026-05-29):

The Forms column counts the inflected word forms stored in the dictionary, while Lemmata counts the distinct base forms they map to (both in thousands). A large gap between the two reflects rich morphology rather than a data error.

Code	Language	Forms (10³)	Lemm. (10³)	Acc.	Comments
`ast`	Asturian	154	36
`bg`	Bulgarian	215	18
`ca`	Catalan	640	63
`cs`	Czech	200	26	0.89	on UD CS-PDT
`cy`	Welsh	363	14
`da`	Danish	555	81	0.92	on UD DA-DDT, alternative: lemmy
`de`	German	730	246	0.95	on UD DE-GSD, see also German-NLP list
`el`	Greek	185	21	0.88	on UD EL-GDT
`en`	English	139	50	0.94	on UD EN-GUM, alternative: LemmInflect
`enm`	Middle English	43	6
`eo`	Esperanto	191	18
`es`	Spanish	666	72	0.95	on UD ES-GSD
`et`	Estonian	141	34		low coverage
`fa`	Persian	13	4		experimental
`fi`	Finnish	3,549	124		see this benchmark
`fr`	French	248	37	0.94	on UD FR-GSD
`ga`	Irish	399	46
`gd`	Gaelic	59	12
`gl`	Galician	426	43
`gv`	Manx	76	13
`hbs`	Serbo-Croatian	674	52		Croatian and Serbian lists to be added later
`hi`	Hindi	58	11		experimental
`hu`	Hungarian	492	36
`hy`	Armenian	247	7
`id`	Indonesian	21	4	0.91	on UD ID-CSUI
`is`	Icelandic	177	15
`it`	Italian	357	28	0.93	on UD IT-ISDT
`ka`	Georgian	66	4
`la`	Latin	892	52
`lb`	Luxembourgish	306	79
`lt`	Lithuanian	268	25
`lv`	Latvian	166	14
`mk`	Macedonian	67	16
`ms`	Malay	18	4
`nb`	Norwegian (Bokmål)	618	134
`nl`	Dutch	366	124	0.92	on UD-NL-Alpino
`nn`	Norwegian (Nynorsk)	68	18
`pl`	Polish	3,670	264	0.91	on UD-PL-PDB
`pt`	Portuguese	924	94	0.92	on UD-PT-GSD
`ro`	Romanian	342	36
`ru`	Russian	633	54		alternative: pymorphy2
`se`	Northern Sámi	115	7
`sk`	Slovak	889	71	0.92	on UD SK-SNK
`sl`	Slovene	165	30
`sq`	Albanian	38	5
`sv`	Swedish	745	93		alternative: lemmy
`sw`	Swahili	4,870	4		experimental
`tl`	Tagalog	39	8		experimental
`tr`	Turkish	1,236	40	0.89	on UD-TR-Boun
`uk`	Ukrainian	388	22		alternative: pymorphy2

Languages marked as having low coverage may be better suited to language-specific libraries, but Simplemma can still provide limited functionality. Where possible, open-source Python alternatives are referenced.

Experimental mentions indicate that the language remains untested or that there could be issues with the underlying data or lemmatization process.

The scores are calculated on Universal Dependencies treebanks on single word tokens (including some contractions but not merged prepositions), they describe to what extent simplemma can accurately map tokens to their lemma form. See the training/ folder of the code repository for more information.

This library is particularly relevant as regards the lemmatization of less frequent words. Its performance in this case is only incidentally captured by the benchmark above. In some languages, a fixed number of words such as pronouns can be further mapped by hand to enhance performance.