Skip to content

To Lowercase Strategy

This module defines the ToLowercaseFallbackStrategy class, which is a concrete implementation of the LemmatizationFallbackStrategy protocol. It represents a fallback strategy that converts tokens to lowercase for specific languages.

Classes

ToLowercaseFallbackStrategy

Bases: LemmatizationFallbackStrategy

ToLowercaseFallbackStrategy is a concrete implementation of the LemmatizationFallbackStrategy protocol. It represents a fallback strategy that converts tokens to lowercase for specific languages.

Source code in simplemma/strategies/fallback/to_lowercase.py
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
class ToLowercaseFallbackStrategy(LemmatizationFallbackStrategy):
    """
    ToLowercaseFallbackStrategy is a concrete implementation of the LemmatizationFallbackStrategy protocol.
    It represents a fallback strategy that converts tokens to lowercase for specific languages.
    """

    __slots__ = ["_langs_to_lower"]

    def __init__(self, langs_to_lower: Set[str] = BETTER_LOWER):
        """
        Initialize the ToLowercaseFallbackStrategy with the specified set of languages to convert to lowercase.

        Args:
            langs_to_lower (Set[str]): The set of languages for which tokens should be converted to lowercase.
                Defaults to `BETTER_LOWER`.

        """
        self._langs_to_lower = langs_to_lower

    def get_lemma(self, token: str, lang: str) -> str:
        """
        Convert the token to lowercase if the language is in the set of languages to convert.

        This method is called when the lemma of a token cannot be determined using other lemmatization strategies.
        It converts the token to lowercase if the language is in the set of languages specified during initialization.

        Args:
            token (str): The token for which the lemma could not be determined.
            lang (str): The language of the token.

        Returns:
            str: The lowercase version of the token if the language is in the set of languages to convert,
                 otherwise returns the original token.

        """
        return token.lower() if lang in self._langs_to_lower else token

Functions

__init__(langs_to_lower=BETTER_LOWER)

Initialize the ToLowercaseFallbackStrategy with the specified set of languages to convert to lowercase.

Parameters:

Name Type Description Default
langs_to_lower Set[str]

The set of languages for which tokens should be converted to lowercase. Defaults to BETTER_LOWER.

BETTER_LOWER
Source code in simplemma/strategies/fallback/to_lowercase.py
20
21
22
23
24
25
26
27
28
29
def __init__(self, langs_to_lower: Set[str] = BETTER_LOWER):
    """
    Initialize the ToLowercaseFallbackStrategy with the specified set of languages to convert to lowercase.

    Args:
        langs_to_lower (Set[str]): The set of languages for which tokens should be converted to lowercase.
            Defaults to `BETTER_LOWER`.

    """
    self._langs_to_lower = langs_to_lower
get_lemma(token, lang)

Convert the token to lowercase if the language is in the set of languages to convert.

This method is called when the lemma of a token cannot be determined using other lemmatization strategies. It converts the token to lowercase if the language is in the set of languages specified during initialization.

Parameters:

Name Type Description Default
token str

The token for which the lemma could not be determined.

required
lang str

The language of the token.

required

Returns:

Name Type Description
str str

The lowercase version of the token if the language is in the set of languages to convert, otherwise returns the original token.

Source code in simplemma/strategies/fallback/to_lowercase.py
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
def get_lemma(self, token: str, lang: str) -> str:
    """
    Convert the token to lowercase if the language is in the set of languages to convert.

    This method is called when the lemma of a token cannot be determined using other lemmatization strategies.
    It converts the token to lowercase if the language is in the set of languages specified during initialization.

    Args:
        token (str): The token for which the lemma could not be determined.
        lang (str): The language of the token.

    Returns:
        str: The lowercase version of the token if the language is in the set of languages to convert,
             otherwise returns the original token.

    """
    return token.lower() if lang in self._langs_to_lower else token