speechbrain.inference.ASR module

Specifies the inference interfaces for Automatic speech Recognition (ASR) modules.

Authors:

Aku Rouhe 2021
Peter Plantinga 2021
Loren Lugosch 2020
Mirco Ravanelli 2020
Titouan Parcollet 2021
Abdel Heba 2021
Andreas Nautsch 2022, 2023
Pooneh Mousavi 2023
Sylvain de Langen 2023, 2024
Adel Moumen 2023, 2024
Pradnya Kandarkar 2023

Summary

Classes:

`ASRStreamingContext`	Streaming metadata, initialized by `make_streaming_context()` (see there for details on initialization of fields here).
`ASRWhisperSegment`	A single chunk of audio for Whisper ASR streaming.
`EncoderASR`	A ready-to-use Encoder ASR model
`EncoderDecoderASR`	A ready-to-use Encoder-Decoder ASR model
`StreamingASR`	A ready-to-use, streaming-capable ASR model.
`WhisperASR`	A ready-to-use Whisper ASR model.

Reference

class speechbrain.inference.ASR.EncoderDecoderASR(*args, **kwargs)[source]

Bases: Pretrained

A ready-to-use Encoder-Decoder ASR model

The class can be used either to run only the encoder (encode()) to extract features or to run the entire encoder-decoder model (transcribe()) to transcribe speech. The given YAML must contain the fields specified in the *_NEEDED[] lists.

Parameters:

*args (tuple)
**kwargs (dict) – Arguments are forwarded to Pretrained parent class.

Example

>>> from speechbrain.inference.ASR import EncoderDecoderASR
>>> tmpdir = getfixture("tmpdir")
>>> asr_model = EncoderDecoderASR.from_hparams(
...     source="speechbrain/asr-crdnn-rnnlm-librispeech",
...     savedir=tmpdir,
... )  
>>> asr_model.transcribe_file("tests/samples/single-mic/example2.flac")  
"MY FATHER HAS REVEALED THE CULPRIT'S NAME"

HPARAMS_NEEDED = ['tokenizer']

MODULES_NEEDED = ['encoder', 'decoder']

transcribe_file(path, **kwargs)[source]

Transcribes the given audiofile into a sequence of words.

Parameters:

path (str) – Path to audio file which to transcribe.
**kwargs (dict) – Arguments forwarded to load_audio.

Returns:

The audiofile transcription produced by this ASR system.

Return type:

str

encode_batch(wavs, wav_lens)[source]

Encodes the input audio into a sequence of hidden states

The waveforms should already be in the model’s desired format. You can call: normalized = EncoderDecoderASR.normalizer(signal, sample_rate) to get a correctly converted signal in most cases.

Parameters:

wavs (torch.Tensor) – Batch of waveforms [batch, time, channels] or [batch, time] depending on the model.
wav_lens (torch.Tensor) – Lengths of the waveforms relative to the longest one in the batch, tensor of shape [batch]. The longest one should have relative length 1.0 and others len(waveform) / max_length. Used for ignoring padding.

Returns:

The encoded batch

Return type:

torch.Tensor

transcribe_batch(wavs, wav_lens)[source]

Transcribes the input audio into a sequence of words

The waveforms should already be in the model’s desired format. You can call: normalized = EncoderDecoderASR.normalizer(signal, sample_rate) to get a correctly converted signal in most cases.

Parameters:

wavs (torch.Tensor) – Batch of waveforms [batch, time, channels] or [batch, time] depending on the model.
wav_lens (torch.Tensor) – Lengths of the waveforms relative to the longest one in the batch, tensor of shape [batch]. The longest one should have relative length 1.0 and others len(waveform) / max_length. Used for ignoring padding.

Returns:

list – Each waveform in the batch transcribed.
tensor – Each predicted token id.

forward(wavs, wav_lens)[source]: Runs full transcription - note: no gradients through decoding

class speechbrain.inference.ASR.EncoderASR(*args, **kwargs)[source]

Bases: Pretrained

A ready-to-use Encoder ASR model

The class can be used either to run only the encoder (encode()) to extract features or to run the entire encoder + decoder function model (transcribe()) to transcribe speech. The given YAML must contain the fields specified in the *_NEEDED[] lists.

Parameters:

*args (tuple)
**kwargs (dict) – Arguments are forwarded to Pretrained parent class.

Example

>>> from speechbrain.inference.ASR import EncoderASR
>>> tmpdir = getfixture("tmpdir")
>>> asr_model = EncoderASR.from_hparams(
...     source="speechbrain/asr-wav2vec2-commonvoice-fr",
...     savedir=tmpdir,
... ) 
>>> asr_model.transcribe_file("samples/audio_samples/example_fr.wav") 

HPARAMS_NEEDED = ['tokenizer', 'decoding_function']

MODULES_NEEDED = ['encoder']

set_decoding_function()[source]

Set the decoding function based on the parameters defined in the hyperparameter file.

The decoding function is determined by the decoding_function specified in the hyperparameter file. It can be either a functools.partial object representing a decoding function or an instance of speechbrain.decoders.ctc.CTCBaseSearcher for beam search decoding.

Raises:

ValueError: If the decoding function is neither a functools.partial nor an instance of: speechbrain.decoders.ctc.CTCBaseSearcher.

Note:

For greedy decoding (functools.partial), the provided decoding_function is assigned directly.
For CTCBeamSearcher decoding, an instance of the specified decoding_function is created, and

additional parameters are added based on the tokenizer type.

transcribe_file(path, **kwargs)[source]

Transcribes the given audiofile into a sequence of words.

Parameters:

path (str) – Path to audio file which to transcribe.
**kwargs (dict) – Arguments forwarded to load_audio.

Returns:

The audiofile transcription produced by this ASR system.

Return type:

str

encode_batch(wavs, wav_lens)[source]

Encodes the input audio into a sequence of hidden states

The waveforms should already be in the model’s desired format. You can call: normalized = EncoderASR.normalizer(signal, sample_rate) to get a correctly converted signal in most cases.

Parameters:

wavs (torch.Tensor) – Batch of waveforms [batch, time, channels] or [batch, time] depending on the model.
wav_lens (torch.Tensor) – Lengths of the waveforms relative to the longest one in the batch, tensor of shape [batch]. The longest one should have relative length 1.0 and others len(waveform) / max_length. Used for ignoring padding.

Returns:

The encoded batch

Return type:

torch.Tensor

transcribe_batch(wavs, wav_lens)[source]

Transcribes the input audio into a sequence of words

The waveforms should already be in the model’s desired format. You can call: normalized = EncoderASR.normalizer(signal, sample_rate) to get a correctly converted signal in most cases.

Parameters:

wavs (torch.Tensor) – Batch of waveforms [batch, time, channels] or [batch, time] depending on the model.
wav_lens (torch.Tensor) – Lengths of the waveforms relative to the longest one in the batch, tensor of shape [batch]. The longest one should have relative length 1.0 and others len(waveform) / max_length. Used for ignoring padding.

Returns:

list – Each waveform in the batch transcribed.
tensor – Each predicted token id.

forward(wavs, wav_lens)[source]: Runs the encoder

class speechbrain.inference.ASR.ASRWhisperSegment(start: float, end: float, chunk: Tensor, lang_id: str | None = None, words: str | None = None, tokens: None | List[str] = None, prompt: None | List[str] = None, avg_log_probs: float | None = None, no_speech_prob: float | None = None)[source]

Bases: object

A single chunk of audio for Whisper ASR streaming.

This object is intended to be mutated as streaming progresses and passed across calls to the lower-level APIs such as encode_chunk, decode_chunk, etc.

start

The start time of the audio chunk.

Type:: float

end

The end time of the audio chunk.

Type:: float

chunk

The audio chunk, shape [time, channels].

Type:: torch.Tensor

lang_id

The language identifier associated with the audio chunk.

Type:: str

words

The predicted words for the audio chunk.

Type:: str

tokens

The predicted tokens for the audio chunk.

Type:: List[int]

prompt

The prompt associated with the audio chunk.

Type:: List[str]

avg_log_probs

The average log probability associated with the prediction.

Type:: float

no_speech_prob

The probability of no speech in the audio chunk.

Type:: float

start: float

end: float

chunk: Tensor

lang_id: str | None = None

words: str | None = None

tokens: List[str] | None = None

prompt: List[str] | None = None

avg_log_probs: float | None = None

no_speech_prob: float | None = None

class speechbrain.inference.ASR.WhisperASR(*args, **kwargs)[source]

Bases: Pretrained

A ready-to-use Whisper ASR model.

The class can be used to run the entire encoder-decoder whisper model. The set of tasks supported are: transcribe, translate, and lang_id. The given YAML must contains the fields specified in the *_NEEDED[] lists.

Parameters:

*args (tuple)
**kwargs (dict) – Arguments are forwarded to Pretrained parent class.

Example

>>> from speechbrain.inference.ASR import WhisperASR
>>> tmpdir = getfixture("tmpdir")
>>> asr_model = WhisperASR.from_hparams(source="speechbrain/asr-whisper-medium-commonvoice-it", savedir=tmpdir,) 
>>> hyp = asr_model.transcribe_file("speechbrain/asr-whisper-medium-commonvoice-it/example-it.wav")  
>>> hyp  
buongiorno a tutti e benvenuti a bordo
>>> _, probs = asr_model.detect_language_file("speechbrain/asr-whisper-medium-commonvoice-it/example-it.wav")  
>>> print(f"Detected language: {max(probs[0], key=probs[0].get)}")  
Detected language: it

HPARAMS_NEEDED = ['language', 'sample_rate']

MODULES_NEEDED = ['whisper', 'decoder']

TASKS = ['transcribe', 'translate', 'lang_id']

detect_language_file(path: str)[source]

Detects the language of the given audiofile. This method only works on input_file of 30 seconds or less.

Parameters:

path (str) – Path to audio file which to transcribe.

Returns:

language_tokens (torch.Tensor) – The detected language tokens.
language_probs (dict) – The probabilities of the detected language tokens.

Raises:

ValueError – If the model doesn’t have language tokens.

detect_language_batch(wav: Tensor)[source]

Detects the language of the given wav Tensor. This method only works on wav files of 30 seconds or less.

Parameters:

wav (torch.tensor) – Batch of waveforms [batch, time, channels].

Returns:

language_tokens (torch.Tensor of shape (batch_size,)) – ids of the most probable language tokens, which appears after the startoftranscript token.
language_probs (List[Dict[str, float]]) – list of dictionaries containing the probability distribution over all languages.

Raises:

ValueError – If the model doesn’t have language tokens.

Example

>>> from speechbrain.inference.ASR import WhisperASR
>>> import torchaudio
>>> tmpdir = getfixture("tmpdir")
>>> asr_model = WhisperASR.from_hparams(
...     source="speechbrain/asr-whisper-medium-commonvoice-it",
...     savedir=tmpdir,
... ) 
>>> wav, _ = torchaudio.load("your_audio") 
>>> language_tokens, language_probs = asr_model.detect_language(wav) 

transcribe_file_streaming(path: str, task: str | None = None, initial_prompt: str | None = None, logprob_threshold: float | None = -1.0, no_speech_threshold=0.6, condition_on_previous_text: bool = False, verbose: bool = False, use_torchaudio_streaming: bool = False, chunk_size: int | None = 30, **kwargs)[source]

Transcribes the given audiofile into a sequence of words. This method supports the following tasks: transcribe, translate, and lang_id. It can process an input audio file longer than 30 seconds by splitting it into chunk_size-second segments.

Parameters:

path (str) – URI/path to the audio to transcribe. When use_torchaudio_streaming is False, uses SB fetching to allow fetching from HF or a local file. When True, resolves the URI through ffmpeg, as documented in torchaudio.io.StreamReader.
task (Optional[str]) – The task to perform. If None, the default task is the one passed in the Whisper model.
initial_prompt (Optional[str]) – The initial prompt to condition the model on.
logprob_threshold (Optional[float]) – The log probability threshold to continue decoding the current segment.
no_speech_threshold (float) – The threshold to skip decoding segment if the no_speech_prob is higher than this value.
condition_on_previous_text (bool) – If True, the model will be condition on the last 224 tokens.
verbose (bool) – If True, print the transcription of each segment.
use_torchaudio_streaming (bool) – Whether the audio file can be loaded in a streaming fashion. If not, transcription is still performed through chunks of audio, but the entire audio file is fetched and loaded at once. This skips the usual fetching method and instead resolves the URI using torchaudio (via ffmpeg).
chunk_size (Optional[int]) – The size of the chunks to split the audio into. The default chunk size is 30 seconds which corresponds to the maximal length that the model can process in one go.
**kwargs (dict) – Arguments forwarded to load_audio

Yields:

ASRWhisperSegment – A new ASRWhisperSegment instance initialized with the provided parameters.

transcribe_file(path: str, task: str | None = None, initial_prompt: str | None = None, logprob_threshold: float | None = -1.0, no_speech_threshold=0.6, condition_on_previous_text: bool = False, verbose: bool = False, use_torchaudio_streaming: bool = False, chunk_size: int | None = 30, **kwargs) → List[ASRWhisperSegment][source]

Run the Whisper model using the specified task on the given audio file and return the ASRWhisperSegment objects for each segment.

This method supports the following tasks: transcribe, translate, and lang_id. It can process an input audio file longer than 30 seconds by splitting it into chunk_size-second segments.

Parameters:

path (str) – URI/path to the audio to transcribe. When use_torchaudio_streaming is False, uses SB fetching to allow fetching from HF or a local file. When True, resolves the URI through ffmpeg, as documented in torchaudio.io.StreamReader.
task (Optional[str]) – The task to perform. If None, the default task is the one passed in the Whisper model. It can be one of the following: transcribe, translate, lang_id.
initial_prompt (Optional[str]) – The initial prompt to condition the model on.
logprob_threshold (Optional[float]) – The log probability threshold to continue decoding the current segment.
no_speech_threshold (float) – The threshold to skip decoding segment if the no_speech_prob is higher than this value.
condition_on_previous_text (bool) – If True, the model will be condition on the last 224 tokens.
verbose (bool) – If True, print the details of each segment.
use_torchaudio_streaming (bool) – Whether the audio file can be loaded in a streaming fashion. If not, transcription is still performed through chunks of audio, but the entire audio file is fetched and loaded at once. This skips the usual fetching method and instead resolves the URI using torchaudio (via ffmpeg).
chunk_size (Optional[int]) – The size of the chunks to split the audio into. The default chunk size is 30 seconds which corresponds to the maximal length that the model can process in one go.
**kwargs (dict) – Arguments forwarded to load_audio

Returns:

results – A list of WhisperASRChunk objects, each containing the task result.

Return type:

list

encode_batch(wavs, wav_lens)[source]

Encodes the input audio into a sequence of hidden states

The waveforms should already be in the model’s desired format. You can call: normalized = EncoderDecoderASR.normalizer(signal, sample_rate) to get a correctly converted signal in most cases.

Parameters:

wavs (torch.tensor) – Batch of waveforms [batch, time, channels].
wav_lens (torch.tensor) – Lengths of the waveforms relative to the longest one in the batch, tensor of shape [batch]. The longest one should have relative length 1.0 and others len(waveform) / max_length. Used for ignoring padding.

Returns:

The encoded batch

Return type:

torch.tensor

transcribe_batch(wavs, wav_lens)[source]

Transcribes the input audio into a sequence of words

The waveforms should already be in the model’s desired format. You can call: normalized = EncoderDecoderASR.normalizer(signal, sample_rate) to get a correctly converted signal in most cases.

Parameters:

wavs (torch.tensor) – Batch of waveforms [batch, time, channels].
wav_lens (torch.tensor) – Lengths of the waveforms relative to the longest one in the batch, tensor of shape [batch]. The longest one should have relative length 1.0 and others len(waveform) / max_length. Used for ignoring padding.

Returns:

list – Each waveform in the batch transcribed.
tensor – Each predicted token id.

forward(wavs, wav_lens)[source]: Runs full transcription - note: no gradients through decoding

class speechbrain.inference.ASR.ASRStreamingContext(config: DynChunkTrainConfig, fea_extractor_context: Any, encoder_context: Any, decoder_context: Any, tokenizer_context: List[Any] | None)[source]

Bases: object

Streaming metadata, initialized by make_streaming_context() (see there for details on initialization of fields here).

This object is intended to be mutate: the same object should be passed across calls as streaming progresses (namely when using the lower-level encode_chunk(), etc. APIs).

Holds some references to opaque streaming contexts, so the context is model-agnostic to an extent.

config: DynChunkTrainConfig: Dynamic chunk training configuration used to initialize the streaming context. Cannot be modified on the fly.

fea_extractor_context: Any: Opaque feature extractor streaming context.

encoder_context: Any: Opaque encoder streaming context.

decoder_context: Any: Opaque decoder streaming context.

tokenizer_context: List[Any] | None: Opaque streaming context for the tokenizer. Initially None. Initialized to a list of tokenizer contexts once batch size can be determined.

class speechbrain.inference.ASR.StreamingASR(*args, **kwargs)[source]

Bases: Pretrained

A ready-to-use, streaming-capable ASR model.

Parameters:

*args (tuple)
**kwargs (dict) – Arguments are forwarded to Pretrained parent class.

Example

>>> from speechbrain.inference.ASR import StreamingASR
>>> from speechbrain.utils.dynamic_chunk_training import DynChunkTrainConfig
>>> tmpdir = getfixture("tmpdir")
>>> asr_model = StreamingASR.from_hparams(source="speechbrain/asr-conformer-streaming-librispeech", savedir=tmpdir,) 
>>> asr_model.transcribe_file("speechbrain/asr-conformer-streaming-librispeech/test-en.wav", DynChunkTrainConfig(24, 8)) 

HPARAMS_NEEDED = ['fea_streaming_extractor', 'make_decoder_streaming_context', 'decoding_function', 'make_tokenizer_streaming_context', 'tokenizer_decode_streaming']

MODULES_NEEDED = ['enc', 'proj_enc']

transcribe_file_streaming(path, dynchunktrain_config: DynChunkTrainConfig, use_torchaudio_streaming: bool = True, **kwargs)[source]

Transcribes the given audio file into a sequence of words, in a streaming fashion, meaning that text is being yield from this generator, in the form of strings to concatenate.

Parameters:

path (str) – URI/path to the audio to transcribe. When use_torchaudio_streaming is False, uses SB fetching to allow fetching from HF or a local file. When True, resolves the URI through ffmpeg, as documented in torchaudio.io.StreamReader.
dynchunktrain_config (DynChunkTrainConfig) – Streaming configuration. Sane values and how much time chunks actually represent is model-dependent.
use_torchaudio_streaming (bool) – Whether the audio file can be loaded in a streaming fashion. If not, transcription is still performed through chunks of audio, but the entire audio file is fetched and loaded at once. This skips the usual fetching method and instead resolves the URI using torchaudio (via ffmpeg).
**kwargs (dict) – Arguments forwarded to load_audio

Yields:

generator of str – An iterator yielding transcribed chunks (strings). There is a yield for every chunk, even if the transcribed string for that chunk is an empty string.

transcribe_file(path, dynchunktrain_config: DynChunkTrainConfig, use_torchaudio_streaming: bool = True)[source]

Transcribes the given audio file into a sequence of words.

Parameters:

path (str) – URI/path to the audio to transcribe. When use_torchaudio_streaming is False, uses SB fetching to allow fetching from HF or a local file. When True, resolves the URI through ffmpeg, as documented in torchaudio.io.StreamReader.
dynchunktrain_config (DynChunkTrainConfig) – Streaming configuration. Sane values and how much time chunks actually represent is model-dependent.
use_torchaudio_streaming (bool) – Whether the audio file can be loaded in a streaming fashion. If not, transcription is still performed through chunks of audio, but the entire audio file is fetched and loaded at once. This skips the usual fetching method and instead resolves the URI using torchaudio (via ffmpeg).

Returns:

The audio file transcription produced by this ASR system.

Return type:

str

make_streaming_context(dynchunktrain_config: DynChunkTrainConfig)[source]

Create a blank streaming context to be passed around for chunk encoding/transcription.

Parameters:: dynchunktrain_config (DynChunkTrainConfig) – Streaming configuration. Sane values and how much time chunks actually represent is model-dependent.
Return type:: ASRStreamingContext

get_chunk_size_frames(dynchunktrain_config: DynChunkTrainConfig) → int[source]

Returns the chunk size in actual audio samples, i.e. the exact expected length along the time dimension of an input chunk tensor (as passed to encode_chunk() and similar low-level streaming functions).

Parameters:: dynchunktrain_config (DynChunkTrainConfig) – The streaming configuration to determine the chunk frame count of.
Return type:: chunk size

encode_chunk(context: ASRStreamingContext, chunk: Tensor, chunk_len: Tensor | None = None)[source]

Encoding of a batch of audio chunks into a batch of encoded sequences. For full speech-to-text offline transcription, use transcribe_batch or transcribe_file. Must be called over a given context in the correct order of chunks over time.

Parameters:

context (ASRStreamingContext) – Mutable streaming context object, which must be specified and reused across calls when streaming. You can obtain an initial context by calling asr.make_streaming_context(config).
chunk (torch.Tensor) – The tensor for an audio chunk of shape [batch size, time]. The time dimension must strictly match asr.get_chunk_size_frames(config). The waveform is expected to be in the model’s expected format (i.e. the sampling rate must be correct).
chunk_len (torch.Tensor, optional) – The relative chunk length tensor of shape [batch size]. This is to be used when the audio in one of the chunks of the batch is ending within this chunk. If unspecified, equivalent to torch.ones((batch_size,)).

Returns:

Encoded output, of a model-dependent shape.

Return type:

torch.Tensor

decode_chunk(context: ASRStreamingContext, x: Tensor) → Tuple[List[str], List[List[int]]][source]

Decodes the output of the encoder into tokens and the associated transcription. Must be called over a given context in the correct order of chunks over time.

Parameters:

context (ASRStreamingContext) – Mutable streaming context object, which should be the same object that was passed to encode_chunk.
x (torch.Tensor) – The output of encode_chunk for a given chunk.

Returns:

list of str – Decoded tokens of length batch_size. The decoded strings can be of 0-length.
list of list of output token hypotheses – List of length batch_size, each holding a list of tokens of any length >=0.

transcribe_chunk(context: ASRStreamingContext, chunk: Tensor, chunk_len: Tensor | None = None)[source]

Transcription of a batch of audio chunks into transcribed text. Must be called over a given context in the correct order of chunks over time.

Parameters:

context (ASRStreamingContext) – Mutable streaming context object, which must be specified and reused across calls when streaming. You can obtain an initial context by calling asr.make_streaming_context(config).
chunk (torch.Tensor) – The tensor for an audio chunk of shape [batch size, time]. The time dimension must strictly match asr.get_chunk_size_frames(config). The waveform is expected to be in the model’s expected format (i.e. the sampling rate must be correct).
chunk_len (torch.Tensor, optional) – The relative chunk length tensor of shape [batch size]. This is to be used when the audio in one of the chunks of the batch is ending within this chunk. If unspecified, equivalent to torch.ones((batch_size,)).

Returns:

Transcribed string for this chunk, might be of length zero.

Return type:

str