Supported languages

The following languages are available using their ISO 639-1 code:

Available languages (2022-09-05)

Code	Language	Forms (10³)	Acc.	Comments
`bg`	Bulgarian	213
`ca`	Catalan	579
`cs`	Czech	187	0.88	on UD CS-PDT
`cy`	Welsh	360
`da`	Danish	554	0.92	on UD DA-DDT, alternative: lemmy
`de`	German	682	0.95	on UD DE-GSD, see also German-NLP list
`el`	Greek	183	0.88	on UD EL-GDT
`en`	English	136	0.94	on UD EN-GUM, alternative: LemmInflect
`enm`	Middle English	38
`es`	Spanish	720	0.94	on UD ES-GSD
`et`	Estonian	133		low coverage
`fa`	Persian	10		experimental
`fi`	Finnish	2,106		evaluation and alternatives: see this benchmark
`fr`	French	217	0.94	on UD FR-GSD
`ga`	Irish	383
`gd`	Gaelic	48
`gl`	Galician	384
`gv`	Manx	62
`hbs`	Serbo-Croatian	838		Croatian and Serbian lists to be added later
`hi`	Hindi	58		experimental
`hu`	Hungarian	458
`hy`	Armenian	323
`id`	Indonesian	17	0.91	on UD ID-CSUI
`is`	Icelandic	175
`it`	Italian	333	0.93	on UD IT-ISDT
`ka`	Georgian	65
`la`	Latin	850
`lb`	Luxembourgish	305
`lt`	Lithuanian	247
`lv`	Latvian	168
`mk`	Macedonian	57
`ms`	Malay	14
`nb`	Norwegian (Bokmål)	617
`nl`	Dutch	254	0.91	on UD-NL-Alpino
`nn`	Norwegian (Nynorsk)
`pl`	Polish	3,733	0.91	on UD-PL-PDB
`pt`	Portuguese	933	0.92	on UD-PT-GSD
`ro`	Romanian	311
`ru`	Russian	607		alternative: pymorphy2
`se`	Northern Sámi	113		experimental
`sk`	Slovak	846	0.92	on UD SK-SNK
`sl`	Slovene	136
`sq`	Albanian	35
`sv`	Swedish	658		alternative: lemmy
`sw`	Swahili	10		experimental
`tl`	Tagalog	33		experimental
`tr`	Turkish	1,333	0.88	on UD-TR-Boun
`uk`	Ukrainian	190		alternative: pymorphy2

Low coverage mentions means one would probably be better off with a language-specific library, but simplemma will work to a limited extent. Open-source alternatives for Python are referenced if possible.

Experimental mentions indicate that the language remains untested or that there could be issues with the underlying data or lemmatization process.

The scores are calculated on Universal Dependencies treebanks on single word tokens (including some contractions but not merged prepositions), they describe to what extent simplemma can accurately map tokens to their lemma form. They can be reproduced by concatenating all available UD files and by using the script udscore.py in the training/ folder.

This library is particularly relevant as regards the lemmatization of less frequent words. Its performance in this case is only incidentally captured by the benchmark above. In some languages, a fixed number of words such as pronouns can be further mapped by hand to enhance performance.