ailia_tokenizer package¶
Classes¶
- class ailia_tokenizer.AiliaTokenizerResult(input_ids, attention_mask, sequence_ids, word_ids, char_starts, char_ends)¶
Bases:
object- char_to_token(batch_or_char_index: int, char_index=None)¶
Return the token index corresponding to a character position.
Equivalent to HuggingFace’s EncodingFast.char_to_token().
- Parameters:
batch_or_char_index (int) – If char_index is None, interpreted as char_index (batch=0). Otherwise, interpreted as batch index.
char_index (int, optional) – Character position within the text.
- Returns:
Token index corresponding to the specified character position.
- Return type:
int or None
- Raises:
AiliaTokenizerError – If the tokenizer does not support word mappings.
Examples
>>> res = tok.encode_plus("Hello world!") >>> res.char_to_token(6) 2
- items()¶
Return available key-value pairs as a dictionary view.
- Returns:
Iterable object containing key-value pairs (excluding internal fields).
- Return type:
dict_items
Examples
>>> res = tok.encode_plus("hello world") >>> dict(res.items()) {'input_ids': array([...]), 'attention_mask': array([...])}
- keys()¶
Return available key names excluding internal attributes.
- Returns:
Iterable object containing available field names, excluding _sequence_ids, _word_ids, _char_starts, and _char_ends.
- Return type:
dict_keys
Examples
>>> res = tok.encode_plus("This is a test.") >>> list(res.keys()) ['input_ids', 'attention_mask']
- sequence_ids(batch)¶
Return sequence group IDs for a given batch.
- Parameters:
batch (int) – Index of the batch instance.
- Returns:
Sequence IDs for each token. 0 indicates tokens from text, 1 from text_pair.
- Return type:
list of int or None
Examples
>>> res = tok("Hello", "World", return_tensors="np") >>> res.sequence_ids(0) [0, 0, 0, 1, 1]
- token_to_word(batch_or_token_index, token_index=None)¶
Return the word index corresponding to a token index.
Equivalent to HuggingFace’s EncodingFast.token_to_word().
- Parameters:
batch_or_token_index (int) – If token_index is None, interpreted as token index (batch 0). Otherwise, interpreted as batch index.
token_index (int, optional) – Token index to map to a word index.
- Returns:
Word index corresponding to token index or None if not applicable.
- Return type:
int or None
- Raises:
AiliaTokenizerError – If the tokenizer does not support word mapping.
Examples
>>> res = tok.encode_plus("beautiful sunshine") >>> res.token_to_word(3) 1
- word_ids(batch_index=0)¶
Return mapping from token index to word index.
- Parameters:
batch_index (int, optional, default=0) – Batch element index to retrieve mapping for.
- Returns:
Each element corresponds to the word index associated with that token. May contain None for special or non-word tokens.
- Return type:
list[Optional[int]]
- Raises:
AiliaTokenizerError – If word ID mapping is not supported for this tokenizer.
Examples
>>> res = tok.encode_plus("The quick brown fox.") >>> res.word_ids(0) [0, 1, 2, 3, None] # (None for special token like [CLS]/[SEP])
- word_to_chars(batch_or_word_index, word_index=None, sequence_index=None)¶
Return the character span (start, end) for a given word index.
Equivalent to HuggingFace’s EncodingFast.word_to_chars().
- Parameters:
batch_or_word_index (int) – If word_index is None, interpreted as word index (batch 0). Otherwise, interpreted as batch index.
word_index (int, optional) – Word index for which to retrieve character range.
sequence_index (int, optional) – Sequence group ID (0 for first sequence, 1 for pair).
- Returns:
Character start/end positions for the specified word. Returns None if the word is not found.
- Return type:
namedtuple(CharSpan, [‘start’, ‘end’]) or None
- Raises:
AiliaTokenizerError – If the tokenizer does not support word mappings.
Examples
>>> res = tok.encode_plus("Nice weather today") >>> res.word_to_chars(2) CharSpan(start=12, end=17)
- class ailia_tokenizer.PreTrainedTokenizer¶
Bases:
objectBase class compatible with the HuggingFace Transformers Tokenizer API.
This class provides common preprocessing, encoding, decoding, padding, and truncation behaviors for various tokenizer models implemented using the ailia Tokenizer backend.
- add_special_tokens(special_tokens_dict)¶
Add or configure special tokens (e.g. [PAD], additional tokens).
Equivalent to HuggingFace’s add_special_tokens().
- Parameters:
special_tokens_dict (dict) –
- Dictionary describing special tokens. Supported keys:
”pad_token” : str
”additional_special_tokens” : list[str]
- Return type:
None
- Raises:
AiliaTokenizerError – If unsupported token type is provided.
Examples
>>> tok = T5Tokenizer.from_pretrained("t5-small") >>> tok.add_special_tokens({"pad_token": "<pad>"}) >>> tok.add_special_tokens({"additional_special_tokens": ["<extra_id_0>", "<extra_id_1>"]})
- batch_decode(sequences: List[List[int]], skip_special_tokens=False) List[str]¶
Batch decode a list of token ID sequences into strings.
Equivalent to HuggingFace’s Tokenizer.batch_decode().
- Parameters:
sequences (list[list[int]]) – Sequences of token IDs to decode.
skip_special_tokens (bool, default=False) – Remove special tokens from decoded outputs.
- Returns:
List of decoded UTF-8 text strings.
- Return type:
list[str]
- batch_encode_plus(text: str | List[str] | List[List[str]], text_pair=None, padding=True, truncation=True, return_tensors=None, max_length=None, split_special_tokens=False, return_token_type_ids=None, add_special_tokens=True)¶
Batch encode multiple texts or text pairs.
- Parameters:
text (list[str] or list[list[str]]) – List of input strings or list of [text, text_pair] pairs.
text_pair (list[str], optional) – Optional list of second input sequences.
padding (bool or str, default=True) – Padding mode (‘longest’, ‘max_length’, True, or False).
truncation (bool or str, default=True) – Truncation strategy to apply if sequences exceed max_length.
return_tensors (str, optional) – If ‘np’, returns NumPy arrays with proper shape.
max_length (int, optional) – Maximum token length.
split_special_tokens (bool, optional) – Whether to split special tokens separately.
return_token_type_ids (bool, optional) – Whether to return token type ids.
add_special_tokens (bool, default=True) – Include special tokens ([CLS], [SEP], etc.).
- Returns:
Structured result with batched encodings.
- Return type:
AiliaTokenizerResult or AiliaTokenizerResultWithTokenTypeIds
Notes
When
textis a single string, each character is treated as a separate item. To encode a single sentence as one sequence, preferencode()orencode_plus().
- convert_ids_to_tokens(ids: int | List[int]) str | List[str]¶
Convert token IDs to token strings using loaded vocabulary.
Equivalent to HuggingFace’s convert_ids_to_tokens().
- Parameters:
ids (int or list[int]) – Token ID or list of token IDs.
- Returns:
Corresponding token(s) from the tokenizer vocabulary.
- Return type:
str or list[str]
- Raises:
AiliaTokenizerError – If model not yet initialized or vocab not loaded.
Examples
>>> tok = BertTokenizer.from_pretrained("bert-base-uncased") >>> tok.convert_ids_to_tokens([101, 2023, 2003, 1037, 3231, 102]) ['[CLS]', 'this', 'is', 'a', 'test', '[SEP]']
- convert_tokens_to_ids(tokens: str | List[str]) int | List[int]¶
Convert string token(s) to integer token IDs.
Equivalent to HuggingFace’s convert_tokens_to_ids().
- Parameters:
tokens (str or list[str]) – Token or list of tokens to convert.
- Returns:
Token ID or list of token IDs corresponding to given tokens.
- Return type:
int or list[int]
- Raises:
AiliaTokenizerError – If tokenizer vocabulary not loaded or token not found.
Examples
>>> tok = BertTokenizer.from_pretrained("bert-base-uncased") >>> tok.convert_tokens_to_ids("hello") 7592 >>> tok.convert_tokens_to_ids(["[CLS]", "hello", "[SEP]"]) [101, 7592, 102]
- decode(input_ids: List[int], skip_special_tokens=False) str¶
Decodes a sequence of token IDs into text.
Equivalent to HuggingFace’s Tokenizer.decode().
- Parameters:
input_ids (list[int]) – Token IDs to decode.
skip_special_tokens (bool, default=False) – If True, special tokens (e.g. [CLS], [SEP]) are removed from the output.
- Returns:
Decoded UTF-8 text corresponding to token IDs.
- Return type:
str
- Raises:
AiliaTokenizerError – If tokenizer not initialized or decoding fails.
- encode(text: str, text_pair=None, padding=True, truncation=True, return_tensors=None, max_length=None, split_special_tokens=False, return_token_type_ids=None, add_special_tokens=True)¶
Encodes text into token IDs.
Equivalent to HuggingFace’s Tokenizer.encode().
- Parameters:
text (str) – Single input string to encode.
text_pair (str, optional) – Second input string for paired encoding.
padding (bool or str, default=True) – Padding strategy. True/’longest’ for dynamic padding, ‘max_length’ for fixed length, False for no padding.
truncation (bool or str, default=True) – Truncation strategy. ‘longest_first’, ‘only_first’, etc.
return_tensors (str, optional) – Specify tensor format (‘np’ for NumPy array).
max_length (int, optional) – Maximum allowed sequence length for truncation/padding.
split_special_tokens (bool, optional, default=False) – Whether to split out special tokens explicitly.
return_token_type_ids (bool, optional) – Whether to return token type IDs.
add_special_tokens (bool, optional, default=True) – Add special tokens (e.g., [CLS], [SEP]).
- Returns:
Encoded integer token IDs representing the input text.
- Return type:
list[int] or numpy.ndarray
- Raises:
AiliaTokenizerError – If the tokenizer has not been initialized by from_pretrained or invalid parameters are provided.
- encode_plus(text: str, text_pair=None, padding=True, truncation=True, return_tensors=None, max_length=None, split_special_tokens=False, return_token_type_ids=None, add_special_tokens=True)¶
Encodes a single text or text pair into dictionary-style results.
Equivalent to HuggingFace’s Tokenizer.encode_plus().
- Parameters:
text (str) – Input sentence text.
text_pair (str, optional) – Second sentence text for paired encoding.
padding (bool or str, default=True) – Padding strategy. Examples: True, False, ‘max_length’, ‘longest’, ‘do_not_pad’.
truncation (bool or str, default=True) – Truncation strategy. Examples: True, ‘longest_first’, ‘only_first’, ‘only_second’.
return_tensors (str, optional) – Tensor type to return. ‘np’ returns numpy.ndarray outputs. If None, returns Python list type.
max_length (int, optional) – Maximum allowed sequence length.
split_special_tokens (bool, default=False) – Whether to split out special tokens during encoding.
return_token_type_ids (bool, optional) – Return token_type_ids for paired encodings (default: automatically enabled for some models).
add_special_tokens (bool, default=True) – Add model-specific special tokens ([CLS], [SEP], etc.) automatically.
- Returns:
- Object containing:
input_ids : list[int] or ndarray
attention_masks : list[int] or ndarray
token_type_ids : list[int] or ndarray (optional)
sequence_ids : sequence group ids
word_ids : corresponding word indices
char_starts/ends : character start/end positions
- Return type:
AiliaTokenizerResult or AiliaTokenizerResultWithTokenTypeIds
- Raises:
AiliaTokenizerError – If invalid arguments or tokenizer not properly initialized.
Examples
>>> tok = BertTokenizer.from_pretrained("bert-base-uncased") >>> res = tok.encode_plus("This is a test.", "Another example.") >>> res["input_ids"] array([101, 1188, 1110, 170, 2774, 119, 102, ..., 102])
- tokenize(text: str) List[str]¶
Tokenizes an input string into subword string tokens.
Equivalent to HuggingFace’s Tokenizer.tokenize().
- Parameters:
text (str) – Input text to tokenize.
- Returns:
List of subword tokens.
- Return type:
list[str]
- Raises:
AiliaTokenizerError – If non-string input is provided or tokenizer is not loaded.
Examples
>>> tok = CLIPTokenizer.from_pretrained("openai/clip-vit-base-patch32") >>> tok.tokenize("Hello world!") ['hello', 'world', '!']
- class ailia_tokenizer.WhisperTokenizer¶
Bases:
PreTrainedTokenizer- classmethod from_pretrained(pretrained_model_name_or_path=None)¶
Load pretrained Whisper tokenizer from local path.
- Parameters:
pretrained_model_name_or_path (str, optional) – Directory path to the tokenizer configuration files. Expected files: - vocab.json - merges.txt - added_tokens.json
- Returns:
Initialized Whisper tokenizer ready for encoding/decoding.
- Return type:
Notes
This function loads vocabulary and merge configuration compatible with Whisper models. PAD token uses EOS (id=50257).
Examples
>>> tok = WhisperTokenizer.from_pretrained() >>> tok.tokenize("This is a test.")
- class ailia_tokenizer.CLIPTokenizer¶
Bases:
PreTrainedTokenizer- classmethod from_pretrained(pretrained_model_name_or_path=None)¶
Load a pretrained CLIP tokenizer.
- Parameters:
pretrained_model_name_or_path (str, optional) – Directory path to the tokenizer configuration files, if available. CLIP does not require external files.
- Returns:
Initialized CLIP-compatible tokenizer instance.
- Return type:
Notes
PAD token uses EOS (ID 49407).
Both SOT and EOT tokens are retained.
Text pairs will be concatenated by replacing SOT in pair sequence with EOT if _retain_sot_replace_to_eot=True.
Examples
>>> tok = CLIPTokenizer.from_pretrained() >>> tok.tokenize("This is a test.") ['this', 'is', 'a', 'test', '.']
- class ailia_tokenizer.XLMRobertaTokenizer¶
Bases:
PreTrainedTokenizer- classmethod from_pretrained(pretrained_model_name_or_path)¶
Load a pretrained XLM-Roberta tokenizer.
- Parameters:
pretrained_model_name_or_path (str) – Directory path that contains ‘sentencepiece.bpe.model’.
- Returns:
Initialized tokenizer compatible with XLM-Roberta architecture.
- Return type:
Notes
SentencePiece model required (sentencepiece.bpe.model).
PAD token ID is set to 1 as used in fairseq.
If added_tokens.json exists in the tokenizer folder, load it additionally.
Examples
>>> tok = XLMRobertaTokenizer.from_pretrained("./tokenizer/") >>> ids = tok.encode("This is multilingual.")
- class ailia_tokenizer.MarianTokenizer¶
Bases:
PreTrainedTokenizer- classmethod from_pretrained(pretrained_model_name_or_path)¶
Load a pretrained Marian tokenizer.
- Parameters:
pretrained_model_name_or_path (str) – Directory path containing ‘source.spm’.
- Returns:
Tokenizer for Marian machine translation models.
- Return type:
Notes
Uses SentencePiece (‘source.spm’) vocabulary.
Does not use SOT tokens (SOT offset = 0).
Examples
>>> tok = MarianTokenizer.from_pretrained("./tokenizer/") >>> tok.encode("Translate this sentence.")
- class ailia_tokenizer.BertJapaneseWordPieceTokenizer¶
Bases:
PreTrainedTokenizer- classmethod from_pretrained(pretrained_model_name_or_path, dict_path)¶
Load Japanese WordPiece BERT tokenizer.
- Parameters:
pretrained_model_name_or_path (str) – Directory path containing ‘vocab.txt’.
dict_path (str) – Path to MeCab-compatible dictionary file.
- Returns:
Fully initialized tokenizer compatible with Japanese BERT WordPiece.
- Return type:
Notes
Retains EOT, not SOT.
Supports word_ids.
Examples
>>> tok = BertJapaneseWordPieceTokenizer.from_pretrained( ... "./tokenizer/", dict_path="./ipadic/") >>> tok.tokenize("日本語のテキストを分かち書きします。") ['日本', '語', 'の', 'テキスト', 'を', '分', 'か', 'ち', '書', 'き', 'します', '。']
- class ailia_tokenizer.BertJapaneseCharacterTokenizer¶
Bases:
PreTrainedTokenizer- classmethod from_pretrained(pretrained_model_name_or_path, dict_path)¶
Load Japanese Character BERT tokenizer.
- Parameters:
pretrained_model_name_or_path (str) – Path containing ‘vocab.txt’.
dict_path (str) – Path to character dictionary.
- Returns:
Character-level tokenizer for Japanese BERT variants.
- Return type:
Examples
>>> tok = BertJapaneseCharacterTokenizer.from_pretrained( ... "./tokenizer/", dict_path="./ipadic/")
- class ailia_tokenizer.T5Tokenizer¶
Bases:
PreTrainedTokenizer- classmethod from_pretrained(pretrained_model_name_or_path)¶
Load pretrained T5 tokenizer.
- Parameters:
pretrained_model_name_or_path (str) – Directory path containing ‘spiece.model’.
- Returns:
Initialized tokenizer for T5 seq2seq models.
- Return type:
Examples
>>> tok = T5Tokenizer.from_pretrained("./tokenizer/") >>> tok.encode("Translate English to German: Hello world")
- class ailia_tokenizer.RobertaTokenizer¶
Bases:
PreTrainedTokenizer- classmethod from_pretrained(pretrained_model_name_or_path)¶
Load pretrained RoBERTa tokenizer.
- Parameters:
pretrained_model_name_or_path (str) – Path containing ‘vocab.json’ and ‘merges.txt’.
- Returns:
Fully initialized RoBERTa tokenizer.
- Return type:
Notes
Retains both SOT and EOT tokens.
Supports word-level positions.
Examples
>>> tok = RobertaTokenizer.from_pretrained("./tokenizer/") >>> tok.encode("This is a RoBERTa-style sentence.")
- class ailia_tokenizer.BertTokenizer¶
Bases:
PreTrainedTokenizer- classmethod from_pretrained(pretrained_model_name_or_path)¶
Load pretrained BERT tokenizer.
- Parameters:
pretrained_model_name_or_path (str) – Path containing ‘vocab.txt’ and ‘tokenizer_config.json’.
- Returns:
Fully initialized BERT tokenizer with word boundary support.
- Return type:
Notes
PAD token is determined from “[PAD]” encoding.
Supports token_type_ids and word_ids.
Examples
>>> tok = BertTokenizer.from_pretrained("./tokenizer/") >>> ids = tok.encode("A test sentence.")
- class ailia_tokenizer.GPT2Tokenizer¶
Bases:
PreTrainedTokenizer- classmethod from_pretrained(pretrained_model_name_or_path)¶
Load pretrained GPT-2 tokenizer.
- Parameters:
pretrained_model_name_or_path (str) – Path containing ‘vocab.json’ and ‘merges.txt’.
- Returns:
Fully initialized GPT-2 BPE tokenizer.
- Return type:
Notes
Uses EOS (ID 50256) for padding.
GPT-2 architecture does not use SOT/EOT markers.
Examples
>>> tok = GPT2Tokenizer.from_pretrained("./tokenizer/") >>> ids = tok.encode("Hello GPT-2 world!")
- class ailia_tokenizer.LlamaTokenizer¶
Bases:
PreTrainedTokenizer- classmethod from_pretrained(pretrained_model_name_or_path)¶
Load pretrained LLaMA tokenizer.
- Parameters:
pretrained_model_name_or_path (str) – Path containing ‘tokenizer.model’.
- Returns:
Initialized SentencePiece-based LLaMA tokenizer.
- Return type:
Notes
Uses SentencePiece model.
Only SOT (start of text) marker is used, EOT is omitted.
Examples
>>> tok = LlamaTokenizer.from_pretrained("./tokenizer/") >>> tok.encode("Generate text with LLaMA model.")