Skip to content

Prefix Decomposition Strategy

This module defines the PrefixDecompositionStrategy class, which is a concrete implementation of the LemmatizationStrategy protocol. It provides lemmatization by performing subword decomposition using pre-defined prefixes.

Classes

PrefixDecompositionStrategy

Bases: LemmatizationStrategy

This class represents a lemmatization strategy that performs lemmatization by performing subword decomposition using pre-defined prefixes. It implements the LemmatizationStrategy protocol.

Source code in simplemma/strategies/prefix_decomposition.py
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
class PrefixDecompositionStrategy(LemmatizationStrategy):
    """
    This class represents a lemmatization strategy that performs lemmatization by performing subword decomposition using pre-defined prefixes.
    It implements the `LemmatizationStrategy` protocol.
    """

    __slots__ = ["_known_prefixes", "_dictionary_lookup"]

    def __init__(
        self,
        known_prefixes: Dict[str, Pattern[str]] = DEFAULT_KNOWN_PREFIXES,
        dictionary_lookup: DictionaryLookupStrategy = DictionaryLookupStrategy(),
    ):
        """
        Initialize the Prefix Decomposition Strategy.

        Args:
            known_prefixes (Dict[str, Pattern[str]]): A dictionary of known prefixes for various languages.
                Defaults to `DEFAULT_KNOWN_PREFIXES`.
            dictionary_lookup (DictionaryLookupStrategy): The dictionary lookup strategy used to find dictionary forms.
                Defaults to `DictionaryLookupStrategy()`.

        """
        self._known_prefixes = known_prefixes
        self._dictionary_lookup = dictionary_lookup

    def get_lemma(self, token: str, lang: str) -> Optional[str]:
        """
        Get Lemma using Prefix Decomposition Strategy

        This method performs lemmatization by performing subword decomposition using pre-defined prefixes.
        It checks if the language has known prefixes defined.
        If a known prefix is found at the start of the token, it extracts the prefix and performs dictionary lookup on the remaining subword.
        If a lemma is found for the subword, it returns the concatenation of the prefix and the lowercase subword.
        If no known prefix is found or no lemma is found for the subword, None is returned.

        Args:
            token (str): The input token to lemmatize.
            lang (str): The language code for the token's language.

        Returns:
            Optional[str]: The lemma for the token, or None if no lemma is found.

        """
        if lang not in self._known_prefixes:
            return None

        prefix_match = self._known_prefixes[lang].match(token)
        if not prefix_match:
            return None
        prefix = prefix_match[1]

        if prefix == token:
            return None

        subword = self._dictionary_lookup.get_lemma(token[len(prefix) :], lang)
        if subword is None:
            return None

        return prefix + subword.lower()

Functions

__init__(known_prefixes=DEFAULT_KNOWN_PREFIXES, dictionary_lookup=DictionaryLookupStrategy())

Initialize the Prefix Decomposition Strategy.

Parameters:

Name Type Description Default
known_prefixes Dict[str, Pattern[str]]

A dictionary of known prefixes for various languages. Defaults to DEFAULT_KNOWN_PREFIXES.

DEFAULT_KNOWN_PREFIXES
dictionary_lookup DictionaryLookupStrategy

The dictionary lookup strategy used to find dictionary forms. Defaults to DictionaryLookupStrategy().

DictionaryLookupStrategy()
Source code in simplemma/strategies/prefix_decomposition.py
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
def __init__(
    self,
    known_prefixes: Dict[str, Pattern[str]] = DEFAULT_KNOWN_PREFIXES,
    dictionary_lookup: DictionaryLookupStrategy = DictionaryLookupStrategy(),
):
    """
    Initialize the Prefix Decomposition Strategy.

    Args:
        known_prefixes (Dict[str, Pattern[str]]): A dictionary of known prefixes for various languages.
            Defaults to `DEFAULT_KNOWN_PREFIXES`.
        dictionary_lookup (DictionaryLookupStrategy): The dictionary lookup strategy used to find dictionary forms.
            Defaults to `DictionaryLookupStrategy()`.

    """
    self._known_prefixes = known_prefixes
    self._dictionary_lookup = dictionary_lookup
get_lemma(token, lang)

Get Lemma using Prefix Decomposition Strategy

This method performs lemmatization by performing subword decomposition using pre-defined prefixes. It checks if the language has known prefixes defined. If a known prefix is found at the start of the token, it extracts the prefix and performs dictionary lookup on the remaining subword. If a lemma is found for the subword, it returns the concatenation of the prefix and the lowercase subword. If no known prefix is found or no lemma is found for the subword, None is returned.

Parameters:

Name Type Description Default
token str

The input token to lemmatize.

required
lang str

The language code for the token's language.

required

Returns:

Type Description
Optional[str]

Optional[str]: The lemma for the token, or None if no lemma is found.

Source code in simplemma/strategies/prefix_decomposition.py
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
def get_lemma(self, token: str, lang: str) -> Optional[str]:
    """
    Get Lemma using Prefix Decomposition Strategy

    This method performs lemmatization by performing subword decomposition using pre-defined prefixes.
    It checks if the language has known prefixes defined.
    If a known prefix is found at the start of the token, it extracts the prefix and performs dictionary lookup on the remaining subword.
    If a lemma is found for the subword, it returns the concatenation of the prefix and the lowercase subword.
    If no known prefix is found or no lemma is found for the subword, None is returned.

    Args:
        token (str): The input token to lemmatize.
        lang (str): The language code for the token's language.

    Returns:
        Optional[str]: The lemma for the token, or None if no lemma is found.

    """
    if lang not in self._known_prefixes:
        return None

    prefix_match = self._known_prefixes[lang].match(token)
    if not prefix_match:
        return None
    prefix = prefix_match[1]

    if prefix == token:
        return None

    subword = self._dictionary_lookup.get_lemma(token[len(prefix) :], lang)
    if subword is None:
        return None

    return prefix + subword.lower()