Skip to content

Tokenizer

Tokenizers module. Provides classes for text tokenization.

Attributes

TOKREGEX = re.compile('(?:(?:[€$¥£+-]?[0-9][0-9.,:%/-]*|St\\.)[\\w_€-]+|https?://[^ ]+|[€$¥£@#§]?\\w[\\w*_-]*|[,;:\\.?!¿¡‽⸮…()\\[\\]–{}—―/‒_“„”⹂‚‘’‛′″‟\'\\"«»‹›<>=+−×÷•·]+)') module-attribute

The regular expresion used by default by RegexTokenizer.

Classes

RegexTokenizer

Bases: Tokenizer

Tokenizer that uses regular expressions to split a text into tokens. This tokenizer splits the input text using the specified regex pattern.

Source code in simplemma/tokenizer.py
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
class RegexTokenizer(Tokenizer):
    """
    Tokenizer that uses regular expressions to split a text into tokens.
    This tokenizer splits the input text using the specified regex pattern.
    """

    __slots__ = ["_splitting_regex"]

    def __init__(self, splitting_regex: Pattern[str] = TOKREGEX) -> None:
        self._splitting_regex = splitting_regex

    def split_text(self, text: str) -> Iterator[str]:
        """
        Split the input text using the specified regex pattern.

        Args:
            text (str): The input text to tokenize.

        Returns:
            Iterator[str]: An iterator yielding the individual tokens.

        """
        return (match[0] for match in self._splitting_regex.finditer(text))

Functions

split_text(text)

Split the input text using the specified regex pattern.

Parameters:

Name Type Description Default
text str

The input text to tokenize.

required

Returns:

Type Description
Iterator[str]

Iterator[str]: An iterator yielding the individual tokens.

Source code in simplemma/tokenizer.py
60
61
62
63
64
65
66
67
68
69
70
71
def split_text(self, text: str) -> Iterator[str]:
    """
    Split the input text using the specified regex pattern.

    Args:
        text (str): The input text to tokenize.

    Returns:
        Iterator[str]: An iterator yielding the individual tokens.

    """
    return (match[0] for match in self._splitting_regex.finditer(text))

Tokenizer

Bases: Protocol

Abstract base class for Tokenizers. Tokenizers are used to split a text into individual tokens.

Source code in simplemma/tokenizer.py
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
class Tokenizer(Protocol):
    """
    Abstract base class for Tokenizers.
    Tokenizers are used to split a text into individual tokens.
    """

    @abstractmethod
    def split_text(self, text: str) -> Iterator[str]:
        """
        Split the input text into tokens.

        Args:
            text (str): The input text to tokenize.

        Returns:
            Iterator[str]: An iterator yielding the individual tokens.

        """
        raise NotImplementedError

Functions

split_text(text) abstractmethod

Split the input text into tokens.

Parameters:

Name Type Description Default
text str

The input text to tokenize.

required

Returns:

Type Description
Iterator[str]

Iterator[str]: An iterator yielding the individual tokens.

Source code in simplemma/tokenizer.py
34
35
36
37
38
39
40
41
42
43
44
45
46
@abstractmethod
def split_text(self, text: str) -> Iterator[str]:
    """
    Split the input text into tokens.

    Args:
        text (str): The input text to tokenize.

    Returns:
        Iterator[str]: An iterator yielding the individual tokens.

    """
    raise NotImplementedError

Functions

simple_tokenizer(text)

Simple regular expression tokenizer.

This function takes a string as input and returns a list of tokens.

Parameters:

Name Type Description Default
text str

The input text to tokenize.

required
splitting_regex Pattern[str]

The regular expression pattern used for tokenization. Defaults to TOKREGEX.

required

Returns:

Type Description
List[str]

List[str]: The list of tokens extracted from the input text.

Source code in simplemma/tokenizer.py
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
def simple_tokenizer(text: str) -> List[str]:
    """
    Simple regular expression tokenizer.

    This function takes a string as input and returns a list of tokens.

    Args:
        text (str): The input text to tokenize.
        splitting_regex (Pattern[str], optional): The regular expression pattern used for tokenization.
            Defaults to `TOKREGEX`.

    Returns:
        List[str]: The list of tokens extracted from the input text.

    """
    return list(_legacy_tokenizer.split_text(text))