Hyphen removal Strategy
This module defines the HyphenRemovalStrategy
class, which is a concrete implementation of the LemmatizationStrategy
protocol.
It provides lemmatization by removing hyphens from tokens and attempting to find dictionary forms.
Classes
HyphenRemovalStrategy
Bases: LemmatizationStrategy
This class represents a lemmatization strategy that performs lemmatization by removing hyphens from tokens
and attempting to find dictionary forms.
It implements the LemmatizationStrategy
protocol.
Source code in simplemma/strategies/hyphen_removal.py
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 |
|
Functions
__init__(dictionary_lookup=DictionaryLookupStrategy())
Initialize the Hyphen Removal Strategy.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dictionary_lookup |
DictionaryLookupStrategy
|
The dictionary lookup strategy used to find dictionary forms.
Defaults to |
DictionaryLookupStrategy()
|
Source code in simplemma/strategies/hyphen_removal.py
25 26 27 28 29 30 31 32 33 34 35 36 |
|
get_lemma(token, lang)
Get Lemma using Hyphen Removal Strategy
This method performs lemmatization by removing hyphens from the token and attempting to find a dictionary form. It splits the token based on hyphen characters, removes hyphens, and forms a candidate lemma for lookup. If a dictionary form is found, it is returned as the lemma. If not found, it attempts to decompose the token by looking up the last part (after the last hyphen) in the dictionary. If a lemma is found for the last part, it replaces the last part in the token and returns the modified token as the lemma. If no dictionary form is found, None is returned.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
token |
str
|
The input token to lemmatize. |
required |
lang |
str
|
The language code for the token's language. |
required |
Returns:
Type | Description |
---|---|
Optional[str]
|
Optional[str]: The lemma for the token, or None if no lemma is found. |
Source code in simplemma/strategies/hyphen_removal.py
38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 |
|