ailia_speech package

Classes

class ailia_speech.AiliaSpeechModel(env_id=-1, num_thread=0, memory_mode=11, task=0, flags=0, callback=None)

Bases: object

set_silent_threshold(silent_threshold, speech_sec, no_speech_sec)

Set silent threshold. If there are more than a certain number of sounded sections, and if the silent section lasts for a certain amount of time or more, the remaining buffer is processed without waiting for 30 seconds.

Parameters:
  • silent_threshold (float) – volume threshold, standard value 0.5

  • speech_sec (float) – speech time, standard value 1.0

  • no_speech_sec (float) – no_speech time, standard value 1.0

transcribe(audio_waveform, sampling_rate, lang=None)

Perform speech recognition. Processes the entire audio at once.

Parameters:
  • audio_waveform (np.ndarray) – PCM data, formatted as either (num_samples) or (channels, num_samples)

  • sampling_rate (int) – Sampling rate (Hz)

  • lang (str, optional, default: None) – Language code (ja, en, etc.) (automatic detection if None)

Yields:

dict

textstr

Recognized speech text

time_stamp_beginfloat

Start time (seconds)

time_stamp_endfloat

End time (seconds)

speaker_idint or None

Speaker ID (when diarization is enabled)

languagestr

Language code

confidencefloat

Confidence level

transcribe_step(audio_waveform, sampling_rate, complete, lang=None)

Perform speech recognition. Processes the audio sequentially.

Parameters:
  • audio_waveform (np.ndarray) – PCM data, formatted as either (num_samples) or (channels, num_samples)

  • sampling_rate (int) – Sampling rate (Hz)

  • complete (bool) – True if this is the final audio input. transcribe_step executes a step each time there is microphone input, and by setting complete to True at the end, the buffer can be flushed.

  • lang (str, optional, default: None) – Language code (ja, en, etc.) (automatic detection if None)

Yields:

dict

textstr

Recognized speech text

time_stamp_beginfloat

Start time (seconds)

time_stamp_endfloat

End time (seconds)

speaker_idint or None

Speaker ID (when diarization is enabled)

languagestr

Language code

confidencefloat

Confidence level

class ailia_speech.Whisper(env_id=-1, num_thread=0, memory_mode=11, task=0, flags=0, callback=None)

Bases: AiliaSpeechModel

__init__(env_id=-1, num_thread=0, memory_mode=11, task=0, flags=0, callback=None)

Constructor of ailia Speech model instance.

Parameters:
  • env_id (int, optional, default: ENVIRONMENT_AUTO(-1)) –

    environment id of ailia execution. To retrieve env_id value, use

    ailia.get_environment_count() / ailia.get_environment() pair

    or

    ailia.get_gpu_environment_id() .

  • num_thread (int, optional, default: MULTITHREAD_AUTO(0)) –

    number of threads. valid values:

    MULTITHREAD_AUTO=0 [means systems’s logical processor count], 1 to 32.

  • memory_mode (int, optional, default: 11 (reuse interstage)) – memory management mode of ailia execution. To retrieve memory_mode value, use ailia.get_memory_mode() .

  • task (int, optional, default: AILIA_SPEECH_TASK_TRANSCRIBE) – AILIA_SPEECH_TASK_TRANSCRIBE or AILIA_SPEECH_TASK_TRANSLATE

  • flags (int, optional, default: AILIA_SPEECH_FLAG_NONE) – Reserved

  • callback (func or None, optional, default: None) –

    Callback for receiving intermediate result text . .. rubric:: Examples

    >>> def f_callback(text):
    ...     print(text)
    

initialize_model(model_path='./', model_type=0, vad_type=0, vad_version='4', diarization_type=None, is_fp16=False)

Initialize and download the model.

Parameters:
  • model_path (string, optional, default: "./") – Destination for saving the model file

  • model_type (int, optional, default: AILIA_SPEECH_MODEL_TYPE_WHISPER_MULTILINGUAL_TINY) – Type of model. Can be set to AILIA_SPEECH_MODEL_TYPE_WHISPER_MULTILINGUAL_TINY, AILIA_SPEECH_MODEL_TYPE_WHISPER_MULTILINGUAL_BASE, AILIA_SPEECH_MODEL_TYPE_WHISPER_MULTILINGUAL_SMALL, AILIA_SPEECH_MODEL_TYPE_WHISPER_MULTILINGUAL_MEDIUM, AILIA_SPEECH_MODEL_TYPE_WHISPER_MULTILINGUAL_LARGE, AILIA_SPEECH_MODEL_TYPE_WHISPER_MULTILINGUAL_LARGE_V3 or AILIA_SPEECH_MODEL_TYPE_WHISPER_MULTILINGUAL_LARGE_V3_TURBO.

  • vad_type (int, optional, default: AILIA_SPEECH_VAD_TYPE_SILERO) – Type of VAD. Can be set to None or AILIA_SPEECH_VAD_TYPE_SILERO.

  • vad_version (string, optional, default: "4") – Versions 4, 5, and 6.2 of SileroVAD can be specified.

  • diarization_type (int, optional, default: None) – Type of diarization. Can be set to None or AILIA_SPEECH_DIARIZATION_TYPE_PYANNOTE_AUDIO. By specifying AILIA_SPEECH_DIARIZATION_TYPE_PYANNOTE_AUDIO, speaker diarization can be enabled. The results of the speaker diarization are stored in speaker_id.

  • is_fp16 (bool, optional, default: False) – Whether to use an FP16 model.

class ailia_speech.SenseVoice(env_id=-1, num_thread=0, memory_mode=11, task=0, flags=0, callback=None)

Bases: AiliaSpeechModel

__init__(env_id=-1, num_thread=0, memory_mode=11, task=0, flags=0, callback=None)

Constructor of ailia Speech model instance.

Parameters:
  • env_id (int, optional, default: ENVIRONMENT_AUTO(-1)) –

    environment id of ailia execution. To retrieve env_id value, use

    ailia.get_environment_count() / ailia.get_environment() pair

    or

    ailia.get_gpu_environment_id() .

  • num_thread (int, optional, default: MULTITHREAD_AUTO(0)) –

    number of threads. valid values:

    MULTITHREAD_AUTO=0 [means systems’s logical processor count], 1 to 32.

  • memory_mode (int, optional, default: 11 (reuse interstage)) – memory management mode of ailia execution. To retrieve memory_mode value, use ailia.get_memory_mode() .

  • task (int, optional, default: AILIA_SPEECH_TASK_TRANSCRIBE) – AILIA_SPEECH_TASK_TRANSCRIBE or AILIA_SPEECH_TASK_TRANSLATE

  • flags (int, optional, default: AILIA_SPEECH_FLAG_NONE) – Reserved

  • callback (func or None, optional, default: None) –

    Callback for receiving intermediate result text . .. rubric:: Examples

    >>> def f_callback(text):
    ...     print(text)
    

initialize_model(model_path='./', model_type=10, vad_type=0, vad_version='4', diarization_type=None, is_fp16=False)

Initialize and download the model.

Parameters:
  • model_path (string, optional, default: "./") – Destination for saving the model file

  • model_type (int, optional, default: AILIA_SPEECH_MODEL_TYPE_SENSEVOICE_SMALL) – Type of model. Can be set to AILIA_SPEECH_MODEL_TYPE_SENSEVOICE_SMALL.

  • vad_type (int, optional, default: AILIA_SPEECH_VAD_TYPE_SILERO) – Type of VAD. Can be set to None or AILIA_SPEECH_VAD_TYPE_SILERO.

  • vad_version (string, optional, default: "4") – Versions 4, 5, and 6.2 of SileroVAD can be specified.

  • diarization_type (int, optional, default: None) – Type of diarization. Can be set to None or AILIA_SPEECH_DIARIZATION_TYPE_PYANNOTE_AUDIO. By specifying AILIA_SPEECH_DIARIZATION_TYPE_PYANNOTE_AUDIO, speaker diarization can be enabled. The results of the speaker diarization are stored in speaker_id.

  • is_fp16 (bool, optional, default: False) – Whether to use an FP16 model.