TokenSampler
Token Sampler module. Provides classes for sampling tokens from text.
- TokenSampler: The Protocol class for all token samplers.
- BaseTokenSampler: An abstract base class for token samplers implementing tokenization using a Tokenizer so the user only has to implement the sampling strategy.
- MostCommonTokenSampler: A token sampler that selects the most common tokens.
- RelaxedMostCommonTokenSampler: A relaxed version of the most common token sampler.
Classes
BaseTokenSampler
Bases: ABC
, TokenSampler
BaseTokenSampler is the base class for token samplers. It uses the given Tokenizer to convert a text in token. Classes inheriting from BaseTokenSampler only have to implement sample_tokens.
Source code in simplemma/token_sampler.py
62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 |
|
Functions
__init__(tokenizer=RegexTokenizer(SPLIT_INPUT))
Initialize the BaseTokenSampler.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tokenizer |
Tokenizer
|
The tokenizer to use for splitting text into tokens.
Defaults to |
RegexTokenizer(SPLIT_INPUT)
|
Source code in simplemma/token_sampler.py
71 72 73 74 75 76 77 78 79 80 81 82 |
|
sample_text(text)
Sample tokens from the input text.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text |
str
|
The input text to sample tokens from. |
required |
Returns:
Type | Description |
---|---|
List[str]
|
List[str]: The sampled tokens. |
Source code in simplemma/token_sampler.py
84 85 86 87 88 89 90 91 92 93 94 95 |
|
sample_tokens(tokens)
abstractmethod
Sample tokens from the given iterable of tokens.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tokens |
Iterable[str]
|
The iterable of tokens to sample from. |
required |
Returns:
Type | Description |
---|---|
List[str]
|
List[str]: The sampled tokens. |
Source code in simplemma/token_sampler.py
97 98 99 100 101 102 103 104 105 106 107 108 109 |
|
MostCommonTokenSampler
Bases: BaseTokenSampler
Token sampler that selects the most common tokens.
Source code in simplemma/token_sampler.py
112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 |
|
Functions
__init__(tokenizer=RegexTokenizer(SPLIT_INPUT), sample_size=100, capitalized_threshold=0.8)
Initialize the MostCommonTokenSampler.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tokenizer |
Tokenizer
|
The tokenizer to use for splitting text into tokens.
Defaults to |
RegexTokenizer(SPLIT_INPUT)
|
sample_size |
int
|
The number of tokens to sample. Defaults to |
100
|
capitalized_threshold |
float
|
The threshold for removing capitalized tokens.
Tokens with a frequency greater than this threshold will be removed. Defaults to |
0.8
|
Source code in simplemma/token_sampler.py
117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 |
|
sample_tokens(tokens)
Sample tokens from the given iterable of tokens.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tokens |
Iterable[str]
|
The iterable of tokens to sample from. |
required |
Returns:
Type | Description |
---|---|
List[str]
|
List[str]: The sampled tokens. |
Source code in simplemma/token_sampler.py
137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 |
|
RelaxedMostCommonTokenSampler
Bases: MostCommonTokenSampler
Relaxed version of the most common token sampler. This sampler uses a relaxed splitting regex pattern and allows for a larger sample size.
Source code in simplemma/token_sampler.py
159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 |
|
Functions
__init__(tokenizer=RegexTokenizer(RELAXED_SPLIT_INPUT), sample_size=1000, capitalized_threshold=0)
Initialize the RelaxedMostCommonTokenSampler.
This is just a MostCommonTokenSampler
with a more relaxed regex pattern.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tokenizer |
Tokenizer
|
The tokenizer to use for splitting text into tokens.
Defaults to |
RegexTokenizer(RELAXED_SPLIT_INPUT)
|
sample_size |
int
|
The number of tokens to sample. Defaults to |
1000
|
capitalized_threshold |
float
|
The threshold for removing capitalized tokens.
Tokens with a frequency greater than this threshold will be removed.
Defaults to |
0
|
Source code in simplemma/token_sampler.py
165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 |
|
TokenSampler
Bases: Protocol
Abstract base class for token samplers.
Token samplers are used to sample tokens from text.
Source code in simplemma/token_sampler.py
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 |
|
Functions
sample_text(text)
abstractmethod
Sample tokens from the input text.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text |
str
|
The input text to sample tokens from. |
required |
Returns:
Type | Description |
---|---|
List[str]
|
List[str]: The sampled tokens. |
Source code in simplemma/token_sampler.py
33 34 35 36 37 38 39 40 41 42 43 44 45 |
|
sample_tokens(tokens)
abstractmethod
Sample tokens from the given iterable of tokens.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tokens |
Iterable[str]
|
The iterable of tokens to sample from. |
required |
Returns:
Type | Description |
---|---|
List[str]
|
List[str]: The sampled tokens. |
Source code in simplemma/token_sampler.py
47 48 49 50 51 52 53 54 55 56 57 58 59 |
|