Tokenizer
Tokenizers module. Provides classes for text tokenization.
- Tokenizer: The Protocol class for all tokenizers.
- RegexTokenizer: A tokenizer based on a regular expresion.
- simple_tokenizer(): A legacy function that wraps the RegexTokenizer's split_text method.
- TOKREGEX: The regular expresion used by default by RegexTokenizer.
Attributes
TOKREGEX = re.compile('(?:(?:[€$¥£+-]?[0-9][0-9.,:%/-]*|St\\.)[\\w_€-]+|https?://[^ ]+|[€$¥£@#§]?\\w[\\w*_-]*|[,;:\\.?!¿¡‽⸮…()\\[\\]–{}—―/‒_“„”⹂‚‘’‛′″‟\'\\"«»‹›<>=+−×÷•·]+)')
module-attribute
The regular expresion used by default by RegexTokenizer.
Classes
RegexTokenizer
Bases: Tokenizer
Tokenizer that uses regular expressions to split a text into tokens. This tokenizer splits the input text using the specified regex pattern.
Source code in simplemma/tokenizer.py
49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 |
|
Functions
split_text(text)
Split the input text using the specified regex pattern.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text |
str
|
The input text to tokenize. |
required |
Returns:
Type | Description |
---|---|
Iterator[str]
|
Iterator[str]: An iterator yielding the individual tokens. |
Source code in simplemma/tokenizer.py
60 61 62 63 64 65 66 67 68 69 70 71 |
|
Tokenizer
Bases: Protocol
Abstract base class for Tokenizers. Tokenizers are used to split a text into individual tokens.
Source code in simplemma/tokenizer.py
28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
|
Functions
split_text(text)
abstractmethod
Split the input text into tokens.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text |
str
|
The input text to tokenize. |
required |
Returns:
Type | Description |
---|---|
Iterator[str]
|
Iterator[str]: An iterator yielding the individual tokens. |
Source code in simplemma/tokenizer.py
34 35 36 37 38 39 40 41 42 43 44 45 46 |
|
Functions
simple_tokenizer(text)
Simple regular expression tokenizer.
This function takes a string as input and returns a list of tokens.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text |
str
|
The input text to tokenize. |
required |
splitting_regex |
Pattern[str]
|
The regular expression pattern used for tokenization.
Defaults to |
required |
Returns:
Type | Description |
---|---|
List[str]
|
List[str]: The list of tokens extracted from the input text. |
Source code in simplemma/tokenizer.py
77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 |
|