Skip to content

Tokenizer

Tokenizers module. Provides classes for text tokenization.

Attributes

TOKREGEX = re.compile('(?:(?:[€$¥£+-]?[0-9][0-9.,:%/-]*|St\\.)[\\w_€-]+|https?://[^ ]+|[€$¥£@#§]?\\w[\\w*_-]*|[,;:\\.?!¿¡‽⸮…()\\[\\]–{}—―/‒_“„”⹂‚‘’‛′″‟\'\\"«»‹›<>=+−×÷•·]+)') module-attribute

The regular expresion used by default by RegexTokenizer.

Classes

RegexTokenizer

Bases: Tokenizer

Tokenizer that uses regular expressions to split a text into tokens. This tokenizer splits the input text using the specified regex pattern.

Source code in simplemma/tokenizer.py
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
class RegexTokenizer(Tokenizer):
    """
    Tokenizer that uses regular expressions to split a text into tokens.
    This tokenizer splits the input text using the specified regex pattern.
    """

    __slots__ = ["_splitting_regex"]

    def __init__(self, splitting_regex: re.Pattern[str] = TOKREGEX) -> None:
        self._splitting_regex = splitting_regex

    def split_text(self, text: str) -> Iterator[str]:
        """
        Split the input text using the specified regex pattern.

        Args:
            text (str): The input text to tokenize.

        Returns:
            Iterator[str]: An iterator yielding the individual tokens.

        """
        return (match[0] for match in self._splitting_regex.finditer(text))

Functions

split_text(text)

Split the input text using the specified regex pattern.

Parameters:

Name Type Description Default
text str

The input text to tokenize.

required

Returns:

Type Description
Iterator[str]

Iterator[str]: An iterator yielding the individual tokens.

Source code in simplemma/tokenizer.py
62
63
64
65
66
67
68
69
70
71
72
73
def split_text(self, text: str) -> Iterator[str]:
    """
    Split the input text using the specified regex pattern.

    Args:
        text (str): The input text to tokenize.

    Returns:
        Iterator[str]: An iterator yielding the individual tokens.

    """
    return (match[0] for match in self._splitting_regex.finditer(text))

Tokenizer

Bases: Protocol

Abstract base class for Tokenizers. Tokenizers are used to split a text into individual tokens.

Source code in simplemma/tokenizer.py
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
class Tokenizer(Protocol):
    """
    Abstract base class for Tokenizers.
    Tokenizers are used to split a text into individual tokens.
    """

    __slots__ = ()

    @abstractmethod
    def split_text(self, text: str) -> Iterator[str]:
        """
        Split the input text into tokens.

        Args:
            text (str): The input text to tokenize.

        Returns:
            Iterator[str]: An iterator yielding the individual tokens.

        """
        raise NotImplementedError

Functions

split_text(text) abstractmethod

Split the input text into tokens.

Parameters:

Name Type Description Default
text str

The input text to tokenize.

required

Returns:

Type Description
Iterator[str]

Iterator[str]: An iterator yielding the individual tokens.

Source code in simplemma/tokenizer.py
36
37
38
39
40
41
42
43
44
45
46
47
48
@abstractmethod
def split_text(self, text: str) -> Iterator[str]:
    """
    Split the input text into tokens.

    Args:
        text (str): The input text to tokenize.

    Returns:
        Iterator[str]: An iterator yielding the individual tokens.

    """
    raise NotImplementedError

Functions

simple_tokenizer(text)

Simple regular expression tokenizer.

This function takes a string as input and returns a list of tokens.

Parameters:

Name Type Description Default
text str

The input text to tokenize.

required

Returns:

Type Description
list[str]

list[str]: The list of tokens extracted from the input text.

Source code in simplemma/tokenizer.py
79
80
81
82
83
84
85
86
87
88
89
90
91
92
def simple_tokenizer(text: str) -> list[str]:
    """
    Simple regular expression tokenizer.

    This function takes a string as input and returns a list of tokens.

    Args:
        text (str): The input text to tokenize.

    Returns:
        list[str]: The list of tokens extracted from the input text.

    """
    return list(_legacy_tokenizer.split_text(text))