Prefix Decomposition Strategy
This module defines the PrefixDecompositionStrategy
class, which is a concrete implementation of the LemmatizationStrategy
protocol.
It provides lemmatization by performing subword decomposition using pre-defined prefixes.
Classes
PrefixDecompositionStrategy
Bases: LemmatizationStrategy
This class represents a lemmatization strategy that performs lemmatization by performing subword decomposition using pre-defined prefixes.
It implements the LemmatizationStrategy
protocol.
Source code in simplemma/strategies/prefix_decomposition.py
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 |
|
Functions
__init__(known_prefixes=DEFAULT_KNOWN_PREFIXES, dictionary_lookup=DictionaryLookupStrategy())
Initialize the Prefix Decomposition Strategy.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
known_prefixes |
Dict[str, Pattern[str]]
|
A dictionary of known prefixes for various languages.
Defaults to |
DEFAULT_KNOWN_PREFIXES
|
dictionary_lookup |
DictionaryLookupStrategy
|
The dictionary lookup strategy used to find dictionary forms.
Defaults to |
DictionaryLookupStrategy()
|
Source code in simplemma/strategies/prefix_decomposition.py
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
|
get_lemma(token, lang)
Get Lemma using Prefix Decomposition Strategy
This method performs lemmatization by performing subword decomposition using pre-defined prefixes. It checks if the language has known prefixes defined. If a known prefix is found at the start of the token, it extracts the prefix and performs dictionary lookup on the remaining subword. If a lemma is found for the subword, it returns the concatenation of the prefix and the lowercase subword. If no known prefix is found or no lemma is found for the subword, None is returned.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
token |
str
|
The input token to lemmatize. |
required |
lang |
str
|
The language code for the token's language. |
required |
Returns:
Type | Description |
---|---|
Optional[str]
|
Optional[str]: The lemma for the token, or None if no lemma is found. |
Source code in simplemma/strategies/prefix_decomposition.py
39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 |
|