Skip to content

Greedy Dictionary Lookup Strategy

This module defines the GreedyDictionaryLookupStrategy class, which is a concrete implementation of the LemmatizationStrategy protocol. It provides lemmatization using a greedy dictionary lookup strategy.

Classes

GreedyDictionaryLookupStrategy

Bases: LemmatizationStrategy

This class represents a lemmatization strategy that performs lemmatization using a greedy dictionary lookup strategy.

Source code in simplemma/strategies/greedy_dictionary_lookup.py
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
class GreedyDictionaryLookupStrategy(LemmatizationStrategy):
    """
    This class represents a lemmatization strategy that performs lemmatization using a greedy dictionary lookup strategy.
    """

    __slots__ = ["_dictionary_factory", "_distance", "_steps"]

    def __init__(
        self,
        dictionary_factory: DictionaryFactory = DefaultDictionaryFactory(),
        steps: int = 1,
        distance: int = 5,
    ):
        """
        Initialize the Greedy Dictionary Lookup Strategy.

        Args:
            dictionary_factory (DictionaryFactory): The dictionary factory used to obtain language dictionaries.
                Defaults to [`DefaultDictionaryFactory()`][simplemma.strategies.dictionaries.dictionary_factory.DefaultDictionaryFactory]..
            steps (int): The maximum number of lemmatization steps to perform. Defaults to `1`.
            distance (int): The maximum allowed Levenshtein distance between candidate lemmas. Defaults to `5`.

        """
        self._dictionary_factory = dictionary_factory
        self._steps = steps
        self._distance = distance

    def get_lemma(self, token: str, lang: str) -> str:
        """
        Get Lemma using Greedy Dictionary Lookup Strategy

        This method performs lemmatization by looking up the token in the language-specific dictionary using a greedy strategy.
        It iteratively applies the dictionary lookup and checks the candidate lemmas based on length and Levenshtein distance.
        It returns the resulting lemma after the specified number of steps or when the conditions are not met.

        Args:
            token (str): The input token to lemmatize.
            lang (str): The language code for the token's language.

        Returns:
            str: The lemma for the token.

        """
        limit = 6 if lang in SHORTER_GREEDY else 8
        if len(token) <= limit:
            return token

        dictionary = self._dictionary_factory.get_dictionary(lang)

        for _ in range(self._steps):
            candidate = dictionary.get(token)

            if (
                not candidate
                or len(candidate) > len(token)
                or levenshtein_dist(candidate, token) > self._distance
            ):
                break

            token = candidate

        return token

Functions

__init__(dictionary_factory=DefaultDictionaryFactory(), steps=1, distance=5)

Initialize the Greedy Dictionary Lookup Strategy.

Parameters:

Name Type Description Default
dictionary_factory DictionaryFactory

The dictionary factory used to obtain language dictionaries. Defaults to DefaultDictionaryFactory()..

DefaultDictionaryFactory()
steps int

The maximum number of lemmatization steps to perform. Defaults to 1.

1
distance int

The maximum allowed Levenshtein distance between candidate lemmas. Defaults to 5.

5
Source code in simplemma/strategies/greedy_dictionary_lookup.py
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
def __init__(
    self,
    dictionary_factory: DictionaryFactory = DefaultDictionaryFactory(),
    steps: int = 1,
    distance: int = 5,
):
    """
    Initialize the Greedy Dictionary Lookup Strategy.

    Args:
        dictionary_factory (DictionaryFactory): The dictionary factory used to obtain language dictionaries.
            Defaults to [`DefaultDictionaryFactory()`][simplemma.strategies.dictionaries.dictionary_factory.DefaultDictionaryFactory]..
        steps (int): The maximum number of lemmatization steps to perform. Defaults to `1`.
        distance (int): The maximum allowed Levenshtein distance between candidate lemmas. Defaults to `5`.

    """
    self._dictionary_factory = dictionary_factory
    self._steps = steps
    self._distance = distance
get_lemma(token, lang)

Get Lemma using Greedy Dictionary Lookup Strategy

This method performs lemmatization by looking up the token in the language-specific dictionary using a greedy strategy. It iteratively applies the dictionary lookup and checks the candidate lemmas based on length and Levenshtein distance. It returns the resulting lemma after the specified number of steps or when the conditions are not met.

Parameters:

Name Type Description Default
token str

The input token to lemmatize.

required
lang str

The language code for the token's language.

required

Returns:

Name Type Description
str str

The lemma for the token.

Source code in simplemma/strategies/greedy_dictionary_lookup.py
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
def get_lemma(self, token: str, lang: str) -> str:
    """
    Get Lemma using Greedy Dictionary Lookup Strategy

    This method performs lemmatization by looking up the token in the language-specific dictionary using a greedy strategy.
    It iteratively applies the dictionary lookup and checks the candidate lemmas based on length and Levenshtein distance.
    It returns the resulting lemma after the specified number of steps or when the conditions are not met.

    Args:
        token (str): The input token to lemmatize.
        lang (str): The language code for the token's language.

    Returns:
        str: The lemma for the token.

    """
    limit = 6 if lang in SHORTER_GREEDY else 8
    if len(token) <= limit:
        return token

    dictionary = self._dictionary_factory.get_dictionary(lang)

    for _ in range(self._steps):
        candidate = dictionary.get(token)

        if (
            not candidate
            or len(candidate) > len(token)
            or levenshtein_dist(candidate, token) > self._distance
        ):
            break

        token = candidate

    return token

Functions