Skip to content

Hyphen removal Strategy

This module defines the HyphenRemovalStrategy class, which is a concrete implementation of the LemmatizationStrategy protocol. It provides lemmatization by removing hyphens from tokens and attempting to find dictionary forms.

Classes

HyphenRemovalStrategy

Bases: LemmatizationStrategy

This class represents a lemmatization strategy that performs lemmatization by removing hyphens from tokens and attempting to find dictionary forms. It implements the LemmatizationStrategy protocol.

Source code in simplemma/strategies/hyphen_removal.py
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
class HyphenRemovalStrategy(LemmatizationStrategy):
    """
    This class represents a lemmatization strategy that performs lemmatization by removing hyphens from tokens
    and attempting to find dictionary forms.
    It implements the `LemmatizationStrategy` protocol.
    """

    __slots__ = ["_dictionary_lookup"]

    def __init__(
        self, dictionary_lookup: DictionaryLookupStrategy = DictionaryLookupStrategy()
    ):
        """
        Initialize the Hyphen Removal Strategy.

        Args:
            dictionary_lookup (DictionaryLookupStrategy): The dictionary lookup strategy used to find dictionary forms.
                Defaults to `DictionaryLookupStrategy()`.

        """
        self._dictionary_lookup = dictionary_lookup

    def get_lemma(self, token: str, lang: str) -> Optional[str]:
        """
        Get Lemma using Hyphen Removal Strategy

        This method performs lemmatization by removing hyphens from the token and attempting to find a dictionary form.
        It splits the token based on hyphen characters, removes hyphens, and forms a candidate lemma for lookup.
        If a dictionary form is found, it is returned as the lemma.
        If not found, it attempts to decompose the token by looking up the last part (after the last hyphen) in the dictionary.
        If a lemma is found for the last part, it replaces the last part in the token and returns the modified token as the lemma.
        If no dictionary form is found, None is returned.

        Args:
            token (str): The input token to lemmatize.
            lang (str): The language code for the token's language.

        Returns:
            Optional[str]: The lemma for the token, or None if no lemma is found.

        """
        token_parts = HYPHEN_REGEX.split(token)
        if len(token_parts) <= 1 or not token_parts[-1]:
            return None

        # try to find a word form without hyphen
        candidate = "".join([t for t in token_parts if t not in HYPHENS]).lower()
        if token[0].isupper():
            candidate = candidate.capitalize()

        lemma = self._dictionary_lookup.get_lemma(candidate, lang)
        if lemma is not None:
            return lemma

        # decompose
        last_part_lemma = self._dictionary_lookup.get_lemma(token_parts[-1], lang)
        if last_part_lemma is not None:
            return "".join(token_parts[:-1] + [last_part_lemma])

        return None

Functions

__init__(dictionary_lookup=DictionaryLookupStrategy())

Initialize the Hyphen Removal Strategy.

Parameters:

Name Type Description Default
dictionary_lookup DictionaryLookupStrategy

The dictionary lookup strategy used to find dictionary forms. Defaults to DictionaryLookupStrategy().

DictionaryLookupStrategy()
Source code in simplemma/strategies/hyphen_removal.py
25
26
27
28
29
30
31
32
33
34
35
36
def __init__(
    self, dictionary_lookup: DictionaryLookupStrategy = DictionaryLookupStrategy()
):
    """
    Initialize the Hyphen Removal Strategy.

    Args:
        dictionary_lookup (DictionaryLookupStrategy): The dictionary lookup strategy used to find dictionary forms.
            Defaults to `DictionaryLookupStrategy()`.

    """
    self._dictionary_lookup = dictionary_lookup
get_lemma(token, lang)

Get Lemma using Hyphen Removal Strategy

This method performs lemmatization by removing hyphens from the token and attempting to find a dictionary form. It splits the token based on hyphen characters, removes hyphens, and forms a candidate lemma for lookup. If a dictionary form is found, it is returned as the lemma. If not found, it attempts to decompose the token by looking up the last part (after the last hyphen) in the dictionary. If a lemma is found for the last part, it replaces the last part in the token and returns the modified token as the lemma. If no dictionary form is found, None is returned.

Parameters:

Name Type Description Default
token str

The input token to lemmatize.

required
lang str

The language code for the token's language.

required

Returns:

Type Description
Optional[str]

Optional[str]: The lemma for the token, or None if no lemma is found.

Source code in simplemma/strategies/hyphen_removal.py
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
def get_lemma(self, token: str, lang: str) -> Optional[str]:
    """
    Get Lemma using Hyphen Removal Strategy

    This method performs lemmatization by removing hyphens from the token and attempting to find a dictionary form.
    It splits the token based on hyphen characters, removes hyphens, and forms a candidate lemma for lookup.
    If a dictionary form is found, it is returned as the lemma.
    If not found, it attempts to decompose the token by looking up the last part (after the last hyphen) in the dictionary.
    If a lemma is found for the last part, it replaces the last part in the token and returns the modified token as the lemma.
    If no dictionary form is found, None is returned.

    Args:
        token (str): The input token to lemmatize.
        lang (str): The language code for the token's language.

    Returns:
        Optional[str]: The lemma for the token, or None if no lemma is found.

    """
    token_parts = HYPHEN_REGEX.split(token)
    if len(token_parts) <= 1 or not token_parts[-1]:
        return None

    # try to find a word form without hyphen
    candidate = "".join([t for t in token_parts if t not in HYPHENS]).lower()
    if token[0].isupper():
        candidate = candidate.capitalize()

    lemma = self._dictionary_lookup.get_lemma(candidate, lang)
    if lemma is not None:
        return lemma

    # decompose
    last_part_lemma = self._dictionary_lookup.get_lemma(token_parts[-1], lang)
    if last_part_lemma is not None:
        return "".join(token_parts[:-1] + [last_part_lemma])

    return None