Introduction

Purpose

Lemmatization is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form. Unlike stemming, lemmatization outputs word units that are still valid linguistic forms.

In modern natural language processing (NLP), this task is often indirectly tackled by more complex systems encompassing a whole processing pipeline. However, it appears that there is no straightforward way to address lemmatization in Python although this task can be crucial in fields such as information retrieval and NLP.

Simplemma provides a simple and multilingual approach to look for base forms or lemmata. It may not be as powerful as full-fledged solutions but it is generic, easy to install and straightforward to use. In particular, it does not need morphosyntactic information and can process a raw series of tokens or even a text with its built-in tokenizer. By design it should be reasonably fast and work in a large majority of cases, without being perfect.

With its comparatively small footprint it is especially useful when speed and simplicity matter, in low-resource contexts, for educational purposes, or as a baseline system for lemmatization and morphological analysis.

Currently, 50 languages are partly or fully supported (see the list of supported languages).

Installation

The current library is written in pure Python with no dependencies: pip install simplemma

pip install -U simplemma for updates
pip install git+https://github.com/adbar/simplemma for the cutting-edge version

The last version supporting Python 3.6 and 3.7 is simplemma==1.0.0.

Usage

Word-by-word

Simplemma is used by selecting a language of interest and then applying the data on a list of words.

>>> import simplemma
# get a word
myword = 'masks'
# decide which language to use and apply it on a word form
>>> simplemma.lemmatize(myword, lang='en')
'mask'
# apply it on a list of tokens
>>> mytokens = ['Hier', 'sind', 'Vaccines']
>>> [simplemma.lemmatize(t, lang='de') for t in mytokens]
['hier', 'sein', 'Vaccines']

Chaining languages

Chaining several languages can improve coverage, they are used in sequence:

>>> from simplemma import lemmatize
>>> lemmatize('Vaccines', lang=('de', 'en'))
'vaccine'
>>> lemmatize('spaghettis', lang='it')
'spaghettis'
>>> lemmatize('spaghettis', lang=('it', 'fr'))
'spaghetti'
>>> lemmatize('spaghetti', lang=('it', 'fr'))
'spaghetto'

Greedier decomposition

For certain languages a greedier decomposition is activated by default as it can be beneficial, mostly due to a certain capacity to address affixes in an unsupervised way. This can be triggered manually by setting the greedy parameter to True.

This option also triggers a stronger reduction through an additional iteration of the search algorithm, e.g. "angekündigten" → "angekündigt" (standard) → "ankündigen" (greedy). In some cases it may be closer to stemming than to lemmatization.

# same example as before, comes to this result in one step
>>> simplemma.lemmatize('spaghettis', lang=('it', 'fr'), greedy=True)
'spaghetto'
# German case described above
>>> simplemma.lemmatize('angekündigten', lang='de', greedy=True)
'ankündigen' # 2 steps: reduction to infinitive verb
>>> simplemma.lemmatize('angekündigten', lang='de', greedy=False)
'angekündigt' # 1 step: reduction to past participle

is_known()

The additional function is_known() checks if a given word is present in the language data:

>>> from simplemma import is_known
>>> is_known('spaghetti', lang='it')
True

Tokenization

A simple tokenization function is provided for convenience:

>>> from simplemma import simple_tokenizer
>>> simple_tokenizer('Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.')
['Lorem', 'ipsum', 'dolor', 'sit', 'amet', ',', 'consectetur', 'adipiscing', 'elit', ',', 'sed', 'do', 'eiusmod', 'tempor', 'incididunt', 'ut', 'labore', 'et', 'dolore', 'magna', 'aliqua', '.']
# for an iterator instead of a list, use the RegexTokenizer directly
>>> from simplemma import RegexTokenizer
>>> RegexTokenizer().split_text('Lorem ipsum dolor sit amet')
<generator object ...>

The functions text_lemmatizer() and lemma_iterator() chain tokenization and lemmatization. They accept the same greedy argument as lemmatize():

>>> from simplemma import text_lemmatizer
>>> sentence = 'Sou o intervalo entre o que desejo ser e os outros me fizeram.'
>>> text_lemmatizer(sentence, lang='pt')
# caveat: desejo is also a noun, should be desejar here
['ser', 'o', 'intervalo', 'entre', 'o', 'que', 'desejo', 'ser', 'e', 'o', 'outro', 'me', 'fazer', '.']
# same principle, returns a generator and not a list
>>> from simplemma import lemma_iterator
>>> lemma_iterator(sentence, lang='pt')

Caveats

# don't expect too much though
# this diminutive form isn't in the model data
>>> simplemma.lemmatize('spaghettini', lang='it')
'spaghettini' # should read 'spaghettino'
# the algorithm cannot choose between valid alternatives yet
>>> simplemma.lemmatize('son', lang='es')
'son' # valid common name, but what about the verb form?

As the focus lies on overall coverage, some short frequent words (typically: pronouns and conjunctions) may need post-processing, this generally concerns a few dozens of tokens per language.

The current absence of morphosyntactic information is an advantage in terms of simplicity. However, it is also an impassable frontier regarding lemmatization accuracy, for example when it comes to disambiguating between past participles and adjectives derived from verbs in Germanic and Romance languages. In most cases, simplemma often does not change such input words.

The greedy algorithm seldom produces invalid forms. It is designed to work best in the low-frequency range, notably for compound words and neologisms. Aggressive decomposition is only useful as a general approach in the case of morphologically-rich languages, where it can also act as a linguistically motivated stemmer.

Bug reports over the issues page are welcome.

Language detection

Language detection works by providing a text and a tuple lang consisting of a series of languages of interest. Each score is a proportion between 0 and 1. The proportions are computed independently per language, so a token recognized in several languages counts towards each of them and the scores need not sum to 1.

The langdetect() function returns a list of language codes along with their corresponding scores, appending "unk" for the proportion of unknown or out-of-vocabulary tokens. The proportion of tokens that belong to the target language(s) can also be obtained directly with the in_target_language() function, which returns a single ratio.

# import necessary functions
>>> from simplemma import in_target_language, langdetect
# language detection
>>> langdetect('"Exoplaneta, též extrasolární planeta, je planeta obíhající kolem jiné hvězdy než kolem Slunce."', lang=("cs", "sk"))
[("cs", 0.75), ("sk", 0.125), ("unk", 0.25)]
# proportion of known words
>>> in_target_language("opera post physica posita (τὰ μετὰ τὰ φυσικά)", lang="la")
0.5

The greedy argument (extensive in past software versions) triggers use of the greedier decomposition algorithm described above, thus extending word coverage and recall of detection at the potential cost of a lesser accuracy.

Advanced usage via classes

The functions described above are suitable for simple usage, but you can have more control by instantiating Simplemma classes and calling their methods instead. Lemmatization is handled by the Lemmatizer class, while language detection is handled by the LanguageDetector class. These in turn rely on different lemmatization strategies, which are implementations of the LemmatizationStrategy protocol. The DefaultStrategy implementation uses a combination of different strategies, one of which is DictionaryLookupStrategy. It looks up tokens in a dictionary created by a DictionaryFactory.

For example, it is possible to conserve RAM by limiting the number of cached language dictionaries (default: 8) by creating a custom DefaultDictionaryFactory with a specific cache_max_size setting, creating a DefaultStrategy using that factory, and then creating a Lemmatizer and/or a LanguageDetector using that strategy:

# import necessary classes
>>> from simplemma import LanguageDetector, Lemmatizer
>>> from simplemma.strategies import DefaultStrategy
>>> from simplemma.strategies.dictionaries import DefaultDictionaryFactory

LANG_CACHE_SIZE = 5  # How many language dictionaries to keep in memory at once (max)
>>> dictionary_factory = DefaultDictionaryFactory(cache_max_size=LANG_CACHE_SIZE)
>>> lemmatization_strategy = DefaultStrategy(dictionary_factory=dictionary_factory)

# lemmatize using the above customized strategy
>>> lemmatizer = Lemmatizer(lemmatization_strategy=lemmatization_strategy)
>>> lemmatizer.lemmatize('doughnuts', lang='en')
'doughnut'

# detect languages using the above customized strategy
>>> language_detector = LanguageDetector('la', lemmatization_strategy=lemmatization_strategy)
>>> language_detector.proportion_in_target_languages("opera post physica posita (τὰ μετὰ τὰ φυσικά)")
0.5

For more information see the extended documentation.

Reducing memory usage

Simplemma provides an alternative solution for situations where low memory usage and fast initialization time are more important than lemmatization and language detection performance. This solution uses a DictionaryFactory that employs a trie as its underlying data structure, rather than a Python dict.

The TrieDictionaryFactory reduces memory usage by an average of 20x and initialization time by 100x, but this comes at the cost of potentially reducing performance by 50% or more, depending on the specific usage.

To use the TrieDictionaryFactory you have to install Simplemma with the marisa-trie extra dependency (available from version 1.1.0):

pip install simplemma[marisa-trie]

Then you have to create a custom strategy using the TrieDictionaryFactory and use that for Lemmatizer and LanguageDetector instances:

>>> from simplemma import LanguageDetector, Lemmatizer
>>> from simplemma.strategies import DefaultStrategy
>>> from simplemma.strategies.dictionaries import TrieDictionaryFactory

>>> lemmatization_strategy = DefaultStrategy(dictionary_factory=TrieDictionaryFactory())

>>> lemmatizer = Lemmatizer(lemmatization_strategy=lemmatization_strategy)
>>> lemmatizer.lemmatize('doughnuts', lang='en')
'doughnut'

>>> language_detector = LanguageDetector('la', lemmatization_strategy=lemmatization_strategy)
>>> language_detector.proportion_in_target_languages("opera post physica posita (τὰ μετὰ τὰ φυσικά)")
0.5

While memory usage and initialization time when using the TrieDictionaryFactory are significantly lower compared to the DefaultDictionaryFactory, that's only true if the trie dictionaries are available on disk. That's not the case when using the TrieDictionaryFactory for the first time, as Simplemma only ships the dictionaries as Python dicts. The trie dictionaries have to be generated once from the Python dicts. That happens on-the-fly when using the TrieDictionaryFactory for the first time for a language and will take a few seconds and use as much memory as loading the Python dicts for the language requires. For further invocations the trie dictionaries get cached on disk.

If the machine that will run Simplemma doesn't have enough memory to generate the trie dictionaries, they can also be generated on another computer with the same CPU architecture and copied over to the cache directory.

For the full set of examples (language chaining, greedy decomposition, tokenization, language detection and class-based usage), see the project README and the API Reference.