ailia_tokenizer package

Classes

class ailia_tokenizer.AiliaTokenizerResult(input_ids, attention_mask, sequence_ids, word_ids, char_starts, char_ends)

Bases: object

char_to_token(batch_or_char_index: int, char_index=None)

Return the token index corresponding to a character position.

Equivalent to HuggingFace’s EncodingFast.char_to_token().

Parameters:
  • batch_or_char_index (int) – If char_index is None, interpreted as char_index (batch=0). Otherwise, interpreted as batch index.

  • char_index (int, optional) – Character position within the text.

Returns:

Token index corresponding to the specified character position.

Return type:

int or None

Raises:

AiliaTokenizerError – If the tokenizer does not support word mappings.

Examples

>>> res = tok.encode_plus("Hello world!")
>>> res.char_to_token(6)
2
items()

Return available key-value pairs as a dictionary view.

Returns:

Iterable object containing key-value pairs (excluding internal fields).

Return type:

dict_items

Examples

>>> res = tok.encode_plus("hello world")
>>> dict(res.items())
{'input_ids': array([...]), 'attention_mask': array([...])}
keys()

Return available key names excluding internal attributes.

Returns:

Iterable object containing available field names, excluding _sequence_ids, _word_ids, _char_starts, and _char_ends.

Return type:

dict_keys

Examples

>>> res = tok.encode_plus("This is a test.")
>>> list(res.keys())
['input_ids', 'attention_mask']
sequence_ids(batch)

Return sequence group IDs for a given batch.

Parameters:

batch (int) – Index of the batch instance.

Returns:

Sequence IDs for each token. 0 indicates tokens from text, 1 from text_pair.

Return type:

list of int or None

Examples

>>> res = tok("Hello", "World", return_tensors="np")
>>> res.sequence_ids(0)
[0, 0, 0, 1, 1]
token_to_word(batch_or_token_index, token_index=None)

Return the word index corresponding to a token index.

Equivalent to HuggingFace’s EncodingFast.token_to_word().

Parameters:
  • batch_or_token_index (int) – If token_index is None, interpreted as token index (batch 0). Otherwise, interpreted as batch index.

  • token_index (int, optional) – Token index to map to a word index.

Returns:

Word index corresponding to token index or None if not applicable.

Return type:

int or None

Raises:

AiliaTokenizerError – If the tokenizer does not support word mapping.

Examples

>>> res = tok.encode_plus("beautiful sunshine")
>>> res.token_to_word(3)
1
word_ids(batch_index=0)

Return mapping from token index to word index.

Parameters:

batch_index (int, optional, default=0) – Batch element index to retrieve mapping for.

Returns:

Each element corresponds to the word index associated with that token. May contain None for special or non-word tokens.

Return type:

list[Optional[int]]

Raises:

AiliaTokenizerError – If word ID mapping is not supported for this tokenizer.

Examples

>>> res = tok.encode_plus("The quick brown fox.")
>>> res.word_ids(0)
[0, 1, 2, 3, None]  # (None for special token like [CLS]/[SEP])
word_to_chars(batch_or_word_index, word_index=None, sequence_index=None)

Return the character span (start, end) for a given word index.

Equivalent to HuggingFace’s EncodingFast.word_to_chars().

Parameters:
  • batch_or_word_index (int) – If word_index is None, interpreted as word index (batch 0). Otherwise, interpreted as batch index.

  • word_index (int, optional) – Word index for which to retrieve character range.

  • sequence_index (int, optional) – Sequence group ID (0 for first sequence, 1 for pair).

Returns:

Character start/end positions for the specified word. Returns None if the word is not found.

Return type:

namedtuple(CharSpan, [‘start’, ‘end’]) or None

Raises:

AiliaTokenizerError – If the tokenizer does not support word mappings.

Examples

>>> res = tok.encode_plus("Nice weather today")
>>> res.word_to_chars(2)
CharSpan(start=12, end=17)
class ailia_tokenizer.PreTrainedTokenizer

Bases: object

Base class compatible with the HuggingFace Transformers Tokenizer API.

This class provides common preprocessing, encoding, decoding, padding, and truncation behaviors for various tokenizer models implemented using the ailia Tokenizer backend.

add_special_tokens(special_tokens_dict)

Add or configure special tokens (e.g. [PAD], additional tokens).

Equivalent to HuggingFace’s add_special_tokens().

Parameters:

special_tokens_dict (dict) –

Dictionary describing special tokens. Supported keys:
  • ”pad_token” : str

  • ”additional_special_tokens” : list[str]

Return type:

None

Raises:

AiliaTokenizerError – If unsupported token type is provided.

Examples

>>> tok = T5Tokenizer.from_pretrained("t5-small")
>>> tok.add_special_tokens({"pad_token": "<pad>"})
>>> tok.add_special_tokens({"additional_special_tokens": ["<extra_id_0>", "<extra_id_1>"]})
batch_decode(sequences: List[List[int]], skip_special_tokens=False) List[str]

Batch decode a list of token ID sequences into strings.

Equivalent to HuggingFace’s Tokenizer.batch_decode().

Parameters:
  • sequences (list[list[int]]) – Sequences of token IDs to decode.

  • skip_special_tokens (bool, default=False) – Remove special tokens from decoded outputs.

Returns:

List of decoded UTF-8 text strings.

Return type:

list[str]

batch_encode_plus(text: str | List[str] | List[List[str]], text_pair=None, padding=True, truncation=True, return_tensors=None, max_length=None, split_special_tokens=False, return_token_type_ids=None, add_special_tokens=True)

Batch encode multiple texts or text pairs.

Parameters:
  • text (list[str] or list[list[str]]) – List of input strings or list of [text, text_pair] pairs.

  • text_pair (list[str], optional) – Optional list of second input sequences.

  • padding (bool or str, default=True) – Padding mode (‘longest’, ‘max_length’, True, or False).

  • truncation (bool or str, default=True) – Truncation strategy to apply if sequences exceed max_length.

  • return_tensors (str, optional) – If ‘np’, returns NumPy arrays with proper shape.

  • max_length (int, optional) – Maximum token length.

  • split_special_tokens (bool, optional) – Whether to split special tokens separately.

  • return_token_type_ids (bool, optional) – Whether to return token type ids.

  • add_special_tokens (bool, default=True) – Include special tokens ([CLS], [SEP], etc.).

Returns:

Structured result with batched encodings.

Return type:

AiliaTokenizerResult or AiliaTokenizerResultWithTokenTypeIds

Notes

  • When text is a single string, each character is treated as a separate item. To encode a single sentence as one sequence, prefer encode() or encode_plus().

convert_ids_to_tokens(ids: int | List[int]) str | List[str]

Convert token IDs to token strings using loaded vocabulary.

Equivalent to HuggingFace’s convert_ids_to_tokens().

Parameters:

ids (int or list[int]) – Token ID or list of token IDs.

Returns:

Corresponding token(s) from the tokenizer vocabulary.

Return type:

str or list[str]

Raises:

AiliaTokenizerError – If model not yet initialized or vocab not loaded.

Examples

>>> tok = BertTokenizer.from_pretrained("bert-base-uncased")
>>> tok.convert_ids_to_tokens([101, 2023, 2003, 1037, 3231, 102])
['[CLS]', 'this', 'is', 'a', 'test', '[SEP]']
convert_tokens_to_ids(tokens: str | List[str]) int | List[int]

Convert string token(s) to integer token IDs.

Equivalent to HuggingFace’s convert_tokens_to_ids().

Parameters:

tokens (str or list[str]) – Token or list of tokens to convert.

Returns:

Token ID or list of token IDs corresponding to given tokens.

Return type:

int or list[int]

Raises:

AiliaTokenizerError – If tokenizer vocabulary not loaded or token not found.

Examples

>>> tok = BertTokenizer.from_pretrained("bert-base-uncased")
>>> tok.convert_tokens_to_ids("hello")
7592
>>> tok.convert_tokens_to_ids(["[CLS]", "hello", "[SEP]"])
[101, 7592, 102]
decode(input_ids: List[int], skip_special_tokens=False) str

Decodes a sequence of token IDs into text.

Equivalent to HuggingFace’s Tokenizer.decode().

Parameters:
  • input_ids (list[int]) – Token IDs to decode.

  • skip_special_tokens (bool, default=False) – If True, special tokens (e.g. [CLS], [SEP]) are removed from the output.

Returns:

Decoded UTF-8 text corresponding to token IDs.

Return type:

str

Raises:

AiliaTokenizerError – If tokenizer not initialized or decoding fails.

encode(text: str, text_pair=None, padding=True, truncation=True, return_tensors=None, max_length=None, split_special_tokens=False, return_token_type_ids=None, add_special_tokens=True)

Encodes text into token IDs.

Equivalent to HuggingFace’s Tokenizer.encode().

Parameters:
  • text (str) – Single input string to encode.

  • text_pair (str, optional) – Second input string for paired encoding.

  • padding (bool or str, default=True) – Padding strategy. True/’longest’ for dynamic padding, ‘max_length’ for fixed length, False for no padding.

  • truncation (bool or str, default=True) – Truncation strategy. ‘longest_first’, ‘only_first’, etc.

  • return_tensors (str, optional) – Specify tensor format (‘np’ for NumPy array).

  • max_length (int, optional) – Maximum allowed sequence length for truncation/padding.

  • split_special_tokens (bool, optional, default=False) – Whether to split out special tokens explicitly.

  • return_token_type_ids (bool, optional) – Whether to return token type IDs.

  • add_special_tokens (bool, optional, default=True) – Add special tokens (e.g., [CLS], [SEP]).

Returns:

Encoded integer token IDs representing the input text.

Return type:

list[int] or numpy.ndarray

Raises:

AiliaTokenizerError – If the tokenizer has not been initialized by from_pretrained or invalid parameters are provided.

encode_plus(text: str, text_pair=None, padding=True, truncation=True, return_tensors=None, max_length=None, split_special_tokens=False, return_token_type_ids=None, add_special_tokens=True)

Encodes a single text or text pair into dictionary-style results.

Equivalent to HuggingFace’s Tokenizer.encode_plus().

Parameters:
  • text (str) – Input sentence text.

  • text_pair (str, optional) – Second sentence text for paired encoding.

  • padding (bool or str, default=True) – Padding strategy. Examples: True, False, ‘max_length’, ‘longest’, ‘do_not_pad’.

  • truncation (bool or str, default=True) – Truncation strategy. Examples: True, ‘longest_first’, ‘only_first’, ‘only_second’.

  • return_tensors (str, optional) – Tensor type to return. ‘np’ returns numpy.ndarray outputs. If None, returns Python list type.

  • max_length (int, optional) – Maximum allowed sequence length.

  • split_special_tokens (bool, default=False) – Whether to split out special tokens during encoding.

  • return_token_type_ids (bool, optional) – Return token_type_ids for paired encodings (default: automatically enabled for some models).

  • add_special_tokens (bool, default=True) – Add model-specific special tokens ([CLS], [SEP], etc.) automatically.

Returns:

Object containing:
  • input_ids : list[int] or ndarray

  • attention_masks : list[int] or ndarray

  • token_type_ids : list[int] or ndarray (optional)

  • sequence_ids : sequence group ids

  • word_ids : corresponding word indices

  • char_starts/ends : character start/end positions

Return type:

AiliaTokenizerResult or AiliaTokenizerResultWithTokenTypeIds

Raises:

AiliaTokenizerError – If invalid arguments or tokenizer not properly initialized.

Examples

>>> tok = BertTokenizer.from_pretrained("bert-base-uncased")
>>> res = tok.encode_plus("This is a test.", "Another example.")
>>> res["input_ids"]
array([101, 1188, 1110, 170, 2774, 119, 102, ..., 102])
tokenize(text: str) List[str]

Tokenizes an input string into subword string tokens.

Equivalent to HuggingFace’s Tokenizer.tokenize().

Parameters:

text (str) – Input text to tokenize.

Returns:

List of subword tokens.

Return type:

list[str]

Raises:

AiliaTokenizerError – If non-string input is provided or tokenizer is not loaded.

Examples

>>> tok = CLIPTokenizer.from_pretrained("openai/clip-vit-base-patch32")
>>> tok.tokenize("Hello world!")
['hello', 'world', '!']
class ailia_tokenizer.WhisperTokenizer

Bases: PreTrainedTokenizer

classmethod from_pretrained(pretrained_model_name_or_path=None)

Load pretrained Whisper tokenizer from local path.

Parameters:

pretrained_model_name_or_path (str, optional) – Directory path to the tokenizer configuration files. Expected files: - vocab.json - merges.txt - added_tokens.json

Returns:

Initialized Whisper tokenizer ready for encoding/decoding.

Return type:

WhisperTokenizer

Notes

This function loads vocabulary and merge configuration compatible with Whisper models. PAD token uses EOS (id=50257).

Examples

>>> tok = WhisperTokenizer.from_pretrained()
>>> tok.tokenize("This is a test.")
class ailia_tokenizer.CLIPTokenizer

Bases: PreTrainedTokenizer

classmethod from_pretrained(pretrained_model_name_or_path=None)

Load a pretrained CLIP tokenizer.

Parameters:

pretrained_model_name_or_path (str, optional) – Directory path to the tokenizer configuration files, if available. CLIP does not require external files.

Returns:

Initialized CLIP-compatible tokenizer instance.

Return type:

CLIPTokenizer

Notes

  • PAD token uses EOS (ID 49407).

  • Both SOT and EOT tokens are retained.

  • Text pairs will be concatenated by replacing SOT in pair sequence with EOT if _retain_sot_replace_to_eot=True.

Examples

>>> tok = CLIPTokenizer.from_pretrained()
>>> tok.tokenize("This is a test.")
['this', 'is', 'a', 'test', '.']
class ailia_tokenizer.XLMRobertaTokenizer

Bases: PreTrainedTokenizer

classmethod from_pretrained(pretrained_model_name_or_path)

Load a pretrained XLM-Roberta tokenizer.

Parameters:

pretrained_model_name_or_path (str) – Directory path that contains ‘sentencepiece.bpe.model’.

Returns:

Initialized tokenizer compatible with XLM-Roberta architecture.

Return type:

XLMRobertaTokenizer

Notes

  • SentencePiece model required (sentencepiece.bpe.model).

  • PAD token ID is set to 1 as used in fairseq.

  • If added_tokens.json exists in the tokenizer folder, load it additionally.

Examples

>>> tok = XLMRobertaTokenizer.from_pretrained("./tokenizer/")
>>> ids = tok.encode("This is multilingual.")
class ailia_tokenizer.MarianTokenizer

Bases: PreTrainedTokenizer

classmethod from_pretrained(pretrained_model_name_or_path)

Load a pretrained Marian tokenizer.

Parameters:

pretrained_model_name_or_path (str) – Directory path containing ‘source.spm’.

Returns:

Tokenizer for Marian machine translation models.

Return type:

MarianTokenizer

Notes

  • Uses SentencePiece (‘source.spm’) vocabulary.

  • Does not use SOT tokens (SOT offset = 0).

Examples

>>> tok = MarianTokenizer.from_pretrained("./tokenizer/")
>>> tok.encode("Translate this sentence.")
class ailia_tokenizer.BertJapaneseWordPieceTokenizer

Bases: PreTrainedTokenizer

classmethod from_pretrained(pretrained_model_name_or_path, dict_path)

Load Japanese WordPiece BERT tokenizer.

Parameters:
  • pretrained_model_name_or_path (str) – Directory path containing ‘vocab.txt’.

  • dict_path (str) – Path to MeCab-compatible dictionary file.

Returns:

Fully initialized tokenizer compatible with Japanese BERT WordPiece.

Return type:

BertJapaneseWordPieceTokenizer

Notes

  • Retains EOT, not SOT.

  • Supports word_ids.

Examples

>>> tok = BertJapaneseWordPieceTokenizer.from_pretrained(
...     "./tokenizer/", dict_path="./ipadic/")
>>> tok.tokenize("日本語のテキストを分かち書きします。")
['日本', '語', 'の', 'テキスト', 'を', '分', 'か', 'ち', '書', 'き', 'します', '。']
class ailia_tokenizer.BertJapaneseCharacterTokenizer

Bases: PreTrainedTokenizer

classmethod from_pretrained(pretrained_model_name_or_path, dict_path)

Load Japanese Character BERT tokenizer.

Parameters:
  • pretrained_model_name_or_path (str) – Path containing ‘vocab.txt’.

  • dict_path (str) – Path to character dictionary.

Returns:

Character-level tokenizer for Japanese BERT variants.

Return type:

BertJapaneseCharacterTokenizer

Examples

>>> tok = BertJapaneseCharacterTokenizer.from_pretrained(
...     "./tokenizer/", dict_path="./ipadic/")
class ailia_tokenizer.T5Tokenizer

Bases: PreTrainedTokenizer

classmethod from_pretrained(pretrained_model_name_or_path)

Load pretrained T5 tokenizer.

Parameters:

pretrained_model_name_or_path (str) – Directory path containing ‘spiece.model’.

Returns:

Initialized tokenizer for T5 seq2seq models.

Return type:

T5Tokenizer

Examples

>>> tok = T5Tokenizer.from_pretrained("./tokenizer/")
>>> tok.encode("Translate English to German: Hello world")
class ailia_tokenizer.RobertaTokenizer

Bases: PreTrainedTokenizer

classmethod from_pretrained(pretrained_model_name_or_path)

Load pretrained RoBERTa tokenizer.

Parameters:

pretrained_model_name_or_path (str) – Path containing ‘vocab.json’ and ‘merges.txt’.

Returns:

Fully initialized RoBERTa tokenizer.

Return type:

RobertaTokenizer

Notes

  • Retains both SOT and EOT tokens.

  • Supports word-level positions.

Examples

>>> tok = RobertaTokenizer.from_pretrained("./tokenizer/")
>>> tok.encode("This is a RoBERTa-style sentence.")
class ailia_tokenizer.BertTokenizer

Bases: PreTrainedTokenizer

classmethod from_pretrained(pretrained_model_name_or_path)

Load pretrained BERT tokenizer.

Parameters:

pretrained_model_name_or_path (str) – Path containing ‘vocab.txt’ and ‘tokenizer_config.json’.

Returns:

Fully initialized BERT tokenizer with word boundary support.

Return type:

BertTokenizer

Notes

  • PAD token is determined from “[PAD]” encoding.

  • Supports token_type_ids and word_ids.

Examples

>>> tok = BertTokenizer.from_pretrained("./tokenizer/")
>>> ids = tok.encode("A test sentence.")
class ailia_tokenizer.GPT2Tokenizer

Bases: PreTrainedTokenizer

classmethod from_pretrained(pretrained_model_name_or_path)

Load pretrained GPT-2 tokenizer.

Parameters:

pretrained_model_name_or_path (str) – Path containing ‘vocab.json’ and ‘merges.txt’.

Returns:

Fully initialized GPT-2 BPE tokenizer.

Return type:

GPT2Tokenizer

Notes

  • Uses EOS (ID 50256) for padding.

  • GPT-2 architecture does not use SOT/EOT markers.

Examples

>>> tok = GPT2Tokenizer.from_pretrained("./tokenizer/")
>>> ids = tok.encode("Hello GPT-2 world!")
class ailia_tokenizer.LlamaTokenizer

Bases: PreTrainedTokenizer

classmethod from_pretrained(pretrained_model_name_or_path)

Load pretrained LLaMA tokenizer.

Parameters:

pretrained_model_name_or_path (str) – Path containing ‘tokenizer.model’.

Returns:

Initialized SentencePiece-based LLaMA tokenizer.

Return type:

LlamaTokenizer

Notes

  • Uses SentencePiece model.

  • Only SOT (start of text) marker is used, EOT is omitted.

Examples

>>> tok = LlamaTokenizer.from_pretrained("./tokenizer/")
>>> tok.encode("Generate text with LLaMA model.")