ailia_speech package¶

Classes¶

class ailia_speech.AiliaSpeechModel(env_id=-1, num_thread=0, memory_mode=11, task=0, flags=0, callback=None)¶

Bases: object

set_silent_threshold(silent_threshold, speech_sec, no_speech_sec)¶

Set silent threshold. If there are more than a certain number of sounded sections, and if the silent section lasts for a certain amount of time or more, the remaining buffer is processed without waiting for 30 seconds.

Parameters:

silent_threshold (float) – volume threshold, standard value 0.5
speech_sec (float) – speech time, standard value 1.0
no_speech_sec (float) – no_speech time, standard value 1.0

transcribe(audio_waveform, sampling_rate, lang=None)¶

Perform speech recognition. Processes the entire audio at once.

Parameters:

audio_waveform (np.ndarray) – PCM data, formatted as either (num_samples) or (channels, num_samples)
sampling_rate (int) – Sampling rate (Hz)
lang (str, optional, default: None) – Language code (ja, en, etc.) (automatic detection if None)

Yields:

dict –

textstr: Recognized speech text
time_stamp_beginfloat: Start time (seconds)
time_stamp_endfloat: End time (seconds)
speaker_idint or None: Speaker ID (when diarization is enabled)
languagestr: Language code
confidencefloat: Confidence level

transcribe_step(audio_waveform, sampling_rate, complete, lang=None)¶

Perform speech recognition. Processes the audio sequentially.

Parameters:

audio_waveform (np.ndarray) – PCM data, formatted as either (num_samples) or (channels, num_samples)
sampling_rate (int) – Sampling rate (Hz)
complete (bool) – True if this is the final audio input. transcribe_step executes a step each time there is microphone input, and by setting complete to True at the end, the buffer can be flushed.
lang (str, optional, default: None) – Language code (ja, en, etc.) (automatic detection if None)

Yields:

dict –

textstr: Recognized speech text
time_stamp_beginfloat: Start time (seconds)
time_stamp_endfloat: End time (seconds)
speaker_idint or None: Speaker ID (when diarization is enabled)
languagestr: Language code
confidencefloat: Confidence level

class ailia_speech.Whisper(env_id=-1, num_thread=0, memory_mode=11, task=0, flags=0, callback=None)¶

Bases: AiliaSpeechModel

__init__(env_id=-1, num_thread=0, memory_mode=11, task=0, flags=0, callback=None)¶

Constructor of ailia Speech model instance.

Parameters:

env_id (int, optional, default: ENVIRONMENT_AUTO(-1)) –
environment id of ailia execution. To retrieve env_id value, use

ailia.get_environment_count() / ailia.get_environment() pair

or
ailia.get_gpu_environment_id() .
num_thread (int, optional, default: MULTITHREAD_AUTO(0)) –
number of threads. valid values:

MULTITHREAD_AUTO=0 [means systems’s logical processor count], 1 to 32.
memory_mode (int, optional, default: 11 (reuse interstage)) – memory management mode of ailia execution. To retrieve memory_mode value, use ailia.get_memory_mode() .
task (int, optional, default: AILIA_SPEECH_TASK_TRANSCRIBE) – AILIA_SPEECH_TASK_TRANSCRIBE or AILIA_SPEECH_TASK_TRANSLATE
flags (int, optional, default: AILIA_SPEECH_FLAG_NONE) – Reserved
callback (func or None, optional, default: None) –
Callback for receiving intermediate result text . .. rubric:: Examples
```
>>> def f_callback(text):
...     print(text)
```

initialize_model(model_path='./', model_type=0, vad_type=0, vad_version='4', diarization_type=None, is_fp16=False)¶

Initialize and download the model.

Parameters:

model_path (string, optional, default: "./") – Destination for saving the model file
model_type (int, optional, default: AILIA_SPEECH_MODEL_TYPE_WHISPER_MULTILINGUAL_TINY) – Type of model. Can be set to AILIA_SPEECH_MODEL_TYPE_WHISPER_MULTILINGUAL_TINY, AILIA_SPEECH_MODEL_TYPE_WHISPER_MULTILINGUAL_BASE, AILIA_SPEECH_MODEL_TYPE_WHISPER_MULTILINGUAL_SMALL, AILIA_SPEECH_MODEL_TYPE_WHISPER_MULTILINGUAL_MEDIUM, AILIA_SPEECH_MODEL_TYPE_WHISPER_MULTILINGUAL_LARGE, AILIA_SPEECH_MODEL_TYPE_WHISPER_MULTILINGUAL_LARGE_V3 or AILIA_SPEECH_MODEL_TYPE_WHISPER_MULTILINGUAL_LARGE_V3_TURBO.
vad_type (int, optional, default: AILIA_SPEECH_VAD_TYPE_SILERO) – Type of VAD. Can be set to None or AILIA_SPEECH_VAD_TYPE_SILERO.
vad_version (string, optional, default: "4") – Versions 4, 5, and 6.2 of SileroVAD can be specified.
diarization_type (int, optional, default: None) – Type of diarization. Can be set to None or AILIA_SPEECH_DIARIZATION_TYPE_PYANNOTE_AUDIO. By specifying AILIA_SPEECH_DIARIZATION_TYPE_PYANNOTE_AUDIO, speaker diarization can be enabled. The results of the speaker diarization are stored in speaker_id.
is_fp16 (bool, optional, default: False) – Whether to use an FP16 model.

class ailia_speech.SenseVoice(env_id=-1, num_thread=0, memory_mode=11, task=0, flags=0, callback=None)¶

Bases: AiliaSpeechModel

__init__(env_id=-1, num_thread=0, memory_mode=11, task=0, flags=0, callback=None)¶

Constructor of ailia Speech model instance.

Parameters:

env_id (int, optional, default: ENVIRONMENT_AUTO(-1)) –
environment id of ailia execution. To retrieve env_id value, use

ailia.get_environment_count() / ailia.get_environment() pair

or
ailia.get_gpu_environment_id() .
num_thread (int, optional, default: MULTITHREAD_AUTO(0)) –
number of threads. valid values:

MULTITHREAD_AUTO=0 [means systems’s logical processor count], 1 to 32.
memory_mode (int, optional, default: 11 (reuse interstage)) – memory management mode of ailia execution. To retrieve memory_mode value, use ailia.get_memory_mode() .
task (int, optional, default: AILIA_SPEECH_TASK_TRANSCRIBE) – AILIA_SPEECH_TASK_TRANSCRIBE or AILIA_SPEECH_TASK_TRANSLATE
flags (int, optional, default: AILIA_SPEECH_FLAG_NONE) – Reserved
callback (func or None, optional, default: None) –
Callback for receiving intermediate result text . .. rubric:: Examples
```
>>> def f_callback(text):
...     print(text)
```

initialize_model(model_path='./', model_type=10, vad_type=0, vad_version='4', diarization_type=None, is_fp16=False)¶

Initialize and download the model.

Parameters:

model_path (string, optional, default: "./") – Destination for saving the model file
model_type (int, optional, default: AILIA_SPEECH_MODEL_TYPE_SENSEVOICE_SMALL) – Type of model. Can be set to AILIA_SPEECH_MODEL_TYPE_SENSEVOICE_SMALL.
vad_type (int, optional, default: AILIA_SPEECH_VAD_TYPE_SILERO) – Type of VAD. Can be set to None or AILIA_SPEECH_VAD_TYPE_SILERO.
vad_version (string, optional, default: "4") – Versions 4, 5, and 6.2 of SileroVAD can be specified.
diarization_type (int, optional, default: None) – Type of diarization. Can be set to None or AILIA_SPEECH_DIARIZATION_TYPE_PYANNOTE_AUDIO. By specifying AILIA_SPEECH_DIARIZATION_TYPE_PYANNOTE_AUDIO, speaker diarization can be enabled. The results of the speaker diarization are stored in speaker_id.
is_fp16 (bool, optional, default: False) – Whether to use an FP16 model.