speechbrain.nnet.losses module

Losses for training neural networks.

Authors
  • Mirco Ravanelli 2020

  • Samuele Cornell 2020

  • Hwidong Na 2020

  • Yan Gao 2020

  • Titouan Parcollet 2020

Summary

Classes:

AdditiveAngularMargin

An implementation of Additive Angular Margin (AAM) proposed in the following paper: '''Margin Matters: Towards More Discriminative Deep Neural Network Embeddings for Speaker Recognition''' (https://arxiv.org/abs/1906.07317)

AngularMargin

An implementation of Angular Margin (AM) proposed in the following paper: '''Margin Matters: Towards More Discriminative Deep Neural Network Embeddings for Speaker Recognition''' (https://arxiv.org/abs/1906.07317)

AutoencoderLoss

An implementation of a standard (non-variational) autoencoder loss

AutoencoderLossDetails

ContrastiveLoss

Contrastive loss as used in wav2vec2.

Laplacian

Computes the Laplacian for image-like data

LaplacianVarianceLoss

The Laplacian variance loss - used to penalize blurriness in image-like data, such as spectrograms.

LogSoftmaxWrapper

param loss_fn:

The LogSoftmax function to wrap.

PitWrapper

Permutation Invariant Wrapper to allow Permutation Invariant Training (PIT) with existing losses.

VariationalAutoencoderLoss

The Variational Autoencoder loss, with support for length masking

VariationalAutoencoderLossDetails

Functions:

bce_loss

Computes binary cross-entropy (BCE) loss.

cal_si_snr

Calculate SI-SNR.

cal_snr

Calculate binaural channel SNR.

ce_kd

Simple version of distillation for cross-entropy loss.

classification_error

Computes the classification error at frame or batch level.

compute_length_mask

Computes a length mask for the specified data shape

compute_masked_loss

Compute the true average loss of a set of waveforms of unequal length.

ctc_loss

CTC loss.

ctc_loss_kd

Knowledge distillation for CTC loss.

distance_diff_loss

A loss function that can be used in cases where a model outputs an arbitrary probability distribution for a discrete variable on an interval scale, such as the length of a sequence, and the ground truth is the precise values of the variable from a data sample.

get_mask

param source:

Shape [T, B, C]

get_si_snr_with_pitwrapper

This function wraps si_snr calculation with the speechbrain pit-wrapper.

get_snr_with_pitwrapper

This function wraps snr calculation with the speechbrain pit-wrapper.

kldiv_loss

Computes the KL-divergence error at the batch level.

l1_loss

Compute the true l1 loss, accounting for length differences.

mse_loss

Compute the true mean squared error, accounting for length differences.

nll_loss

Computes negative log likelihood loss.

nll_loss_kd

Knowledge distillation for negative log-likelihood loss.

reduce_loss

Performs the specified reduction of the raw loss value

transducer_loss

Transducer loss, see speechbrain/nnet/loss/transducer_loss.py.

truncate

Ensure that predictions and targets are the same length.

Reference

speechbrain.nnet.losses.transducer_loss(logits, targets, input_lens, target_lens, blank_index, reduction='mean', use_torchaudio=True)[source]

Transducer loss, see speechbrain/nnet/loss/transducer_loss.py.

Parameters:
  • logits (torch.Tensor) – Predicted tensor, of shape [batch, maxT, maxU, num_labels].

  • targets (torch.Tensor) – Target tensor, without any blanks, of shape [batch, target_len].

  • input_lens (torch.Tensor) – Length of each utterance.

  • target_lens (torch.Tensor) – Length of each target sequence.

  • blank_index (int) – The location of the blank symbol among the label indices.

  • reduction (str) – Specifies the reduction to apply to the output: ‘mean’ | ‘batchmean’ | ‘sum’.

  • use_torchaudio (bool) – If True, use Transducer loss implementation from torchaudio, otherwise, use Speechbrain Numba implementation.

Return type:

The computed transducer loss.

class speechbrain.nnet.losses.PitWrapper(base_loss)[source]

Bases: Module

Permutation Invariant Wrapper to allow Permutation Invariant Training (PIT) with existing losses.

Permutation invariance is calculated over the sources/classes axis which is assumed to be the rightmost dimension: predictions and targets tensors are assumed to have shape [batch, …, channels, sources].

Parameters:

base_loss (function) – Base loss function, e.g. torch.nn.MSELoss. It is assumed that it takes two arguments: predictions and targets and no reduction is performed. (if a pytorch loss is used, the user must specify reduction=”none”).

Example

>>> pit_mse = PitWrapper(nn.MSELoss(reduction="none"))
>>> targets = torch.rand((2, 32, 4))
>>> p = (3, 0, 2, 1)
>>> predictions = targets[..., p]
>>> loss, opt_p = pit_mse(predictions, targets)
>>> loss
tensor([0., 0.])
reorder_tensor(tensor, p)[source]
Parameters:
  • tensor (torch.Tensor) – torch.Tensor to reorder given the optimal permutation, of shape [batch, …, sources].

  • p (list of tuples) – List of optimal permutations, e.g. for batch=2 and n_sources=3 [(0, 1, 2), (0, 2, 1].

Returns:

reordered – Reordered tensor given permutation p.

Return type:

torch.Tensor

forward(preds, targets)[source]
Parameters:
  • preds (torch.Tensor) – Network predictions tensor, of shape [batch, channels, …, sources].

  • targets (torch.Tensor) – Target tensor, of shape [batch, channels, …, sources].

Returns:

  • loss (torch.Tensor) – Permutation invariant loss for current examples, tensor of shape [batch]

  • perms (list) – List of indexes for optimal permutation of the inputs over sources. e.g., [(0, 1, 2), (2, 1, 0)] for three sources and 2 examples per batch.

speechbrain.nnet.losses.ctc_loss(log_probs, targets, input_lens, target_lens, blank_index, reduction='mean')[source]

CTC loss.

Parameters:
  • log_probs (torch.Tensor) – Predicted tensor, of shape [batch, time, chars].

  • targets (torch.Tensor) – Target tensor, without any blanks, of shape [batch, target_len]

  • input_lens (torch.Tensor) – Length of each utterance.

  • target_lens (torch.Tensor) – Length of each target sequence.

  • blank_index (int) – The location of the blank symbol among the character indexes.

  • reduction (str) – What reduction to apply to the output. ‘mean’, ‘sum’, ‘batch’, ‘batchmean’, ‘none’. See pytorch for ‘mean’, ‘sum’, ‘none’. The ‘batch’ option returns one loss per item in the batch, ‘batchmean’ returns sum / batch size.

Return type:

The computed CTC loss.

speechbrain.nnet.losses.l1_loss(predictions, targets, length=None, allowed_len_diff=3, reduction='mean')[source]

Compute the true l1 loss, accounting for length differences.

Parameters:
  • predictions (torch.Tensor) – Predicted tensor, of shape [batch, time, *].

  • targets (torch.Tensor) – Target tensor with the same size as predicted tensor.

  • length (torch.Tensor) – Length of each utterance for computing true error with a mask.

  • allowed_len_diff (int) – Length difference that will be tolerated before raising an exception.

  • reduction (str) – Options are ‘mean’, ‘batch’, ‘batchmean’, ‘sum’. See pytorch for ‘mean’, ‘sum’. The ‘batch’ option returns one loss per item in the batch, ‘batchmean’ returns sum / batch size.

Return type:

The computed L1 loss.

Example

>>> probs = torch.tensor([[0.9, 0.1, 0.1, 0.9]])
>>> l1_loss(probs, torch.tensor([[1., 0., 0., 1.]]))
tensor(0.1000)
speechbrain.nnet.losses.mse_loss(predictions, targets, length=None, allowed_len_diff=3, reduction='mean')[source]

Compute the true mean squared error, accounting for length differences.

Parameters:
  • predictions (torch.Tensor) – Predicted tensor, of shape [batch, time, *].

  • targets (torch.Tensor) – Target tensor with the same size as predicted tensor.

  • length (torch.Tensor) – Length of each utterance for computing true error with a mask.

  • allowed_len_diff (int) – Length difference that will be tolerated before raising an exception.

  • reduction (str) – Options are ‘mean’, ‘batch’, ‘batchmean’, ‘sum’. See pytorch for ‘mean’, ‘sum’. The ‘batch’ option returns one loss per item in the batch, ‘batchmean’ returns sum / batch size.

Return type:

The computed MSE loss.

Example

>>> probs = torch.tensor([[0.9, 0.1, 0.1, 0.9]])
>>> mse_loss(probs, torch.tensor([[1., 0., 0., 1.]]))
tensor(0.0100)
speechbrain.nnet.losses.classification_error(probabilities, targets, length=None, allowed_len_diff=3, reduction='mean')[source]

Computes the classification error at frame or batch level.

Parameters:
  • probabilities (torch.Tensor) – The posterior probabilities of shape [batch, prob] or [batch, frames, prob]

  • targets (torch.Tensor) – The targets, of shape [batch] or [batch, frames]

  • length (torch.Tensor) – Length of each utterance, if frame-level loss is desired.

  • allowed_len_diff (int) – Length difference that will be tolerated before raising an exception.

  • reduction (str) – Options are ‘mean’, ‘batch’, ‘batchmean’, ‘sum’. See pytorch for ‘mean’, ‘sum’. The ‘batch’ option returns one loss per item in the batch, ‘batchmean’ returns sum / batch size.

Return type:

The computed classification error.

Example

>>> probs = torch.tensor([[[0.9, 0.1], [0.1, 0.9]]])
>>> classification_error(probs, torch.tensor([1, 1]))
tensor(0.5000)
speechbrain.nnet.losses.nll_loss(log_probabilities, targets, length=None, label_smoothing=0.0, allowed_len_diff=3, weight=None, reduction='mean')[source]

Computes negative log likelihood loss.

Parameters:
  • log_probabilities (torch.Tensor) – The probabilities after log has been applied. Format is [batch, log_p] or [batch, frames, log_p].

  • targets (torch.Tensor) – The targets, of shape [batch] or [batch, frames].

  • length (torch.Tensor) – Length of each utterance, if frame-level loss is desired.

  • label_smoothing (float) – The amount of smoothing to apply to labels (default 0.0, no smoothing)

  • allowed_len_diff (int) – Length difference that will be tolerated before raising an exception.

  • weight (torch.Tensor) – A manual rescaling weight given to each class. If given, has to be a Tensor of size C.

  • reduction (str) – Options are ‘mean’, ‘batch’, ‘batchmean’, ‘sum’. See pytorch for ‘mean’, ‘sum’. The ‘batch’ option returns one loss per item in the batch, ‘batchmean’ returns sum / batch size.

Return type:

The computed NLL loss.

Example

>>> probs = torch.tensor([[0.9, 0.1], [0.1, 0.9]])
>>> nll_loss(torch.log(probs), torch.tensor([1, 1]))
tensor(1.2040)
speechbrain.nnet.losses.bce_loss(inputs, targets, length=None, weight=None, pos_weight=None, reduction='mean', allowed_len_diff=3, label_smoothing=0.0)[source]

Computes binary cross-entropy (BCE) loss. It also applies the sigmoid function directly (this improves the numerical stability).

Parameters:
  • inputs (torch.Tensor) – The output before applying the final softmax Format is [batch[, 1]?] or [batch, frames[, 1]?]. (Works with or without a singleton dimension at the end).

  • targets (torch.Tensor) – The targets, of shape [batch] or [batch, frames].

  • length (torch.Tensor) – Length of each utterance, if frame-level loss is desired.

  • weight (torch.Tensor) – A manual rescaling weight if provided it’s repeated to match input tensor shape.

  • pos_weight (torch.Tensor) – A weight of positive examples. Must be a vector with length equal to the number of classes.

  • reduction (str) – Options are ‘mean’, ‘batch’, ‘batchmean’, ‘sum’. See pytorch for ‘mean’, ‘sum’. The ‘batch’ option returns one loss per item in the batch, ‘batchmean’ returns sum / batch size.

  • allowed_len_diff (int) – Length difference that will be tolerated before raising an exception.

  • label_smoothing (float) – The amount of smoothing to apply to labels (default 0.0, no smoothing)

Return type:

The computed BCE loss.

Example

>>> inputs = torch.tensor([10.0, -6.0])
>>> targets = torch.tensor([1, 0])
>>> bce_loss(inputs, targets)
tensor(0.0013)
speechbrain.nnet.losses.kldiv_loss(log_probabilities, targets, length=None, label_smoothing=0.0, allowed_len_diff=3, pad_idx=0, reduction='mean')[source]

Computes the KL-divergence error at the batch level. This loss applies label smoothing directly to the targets

Parameters:
  • log_probabilities (torch.Tensor) – The posterior probabilities of shape [batch, prob] or [batch, frames, prob].

  • targets (torch.Tensor) – The targets, of shape [batch] or [batch, frames].

  • length (torch.Tensor) – Length of each utterance, if frame-level loss is desired.

  • label_smoothing (float) – The amount of smoothing to apply to labels (default 0.0, no smoothing)

  • allowed_len_diff (int) – Length difference that will be tolerated before raising an exception.

  • pad_idx (int) – Entries of this value are considered padding.

  • reduction (str) – Options are ‘mean’, ‘batch’, ‘batchmean’, ‘sum’. See pytorch for ‘mean’, ‘sum’. The ‘batch’ option returns one loss per item in the batch, ‘batchmean’ returns sum / batch size.

Return type:

The computed kldiv loss.

Example

>>> probs = torch.tensor([[0.9, 0.1], [0.1, 0.9]])
>>> kldiv_loss(torch.log(probs), torch.tensor([1, 1]))
tensor(1.2040)
speechbrain.nnet.losses.distance_diff_loss(predictions, targets, length=None, beta=0.25, max_weight=100.0, reduction='mean')[source]

A loss function that can be used in cases where a model outputs an arbitrary probability distribution for a discrete variable on an interval scale, such as the length of a sequence, and the ground truth is the precise values of the variable from a data sample.

The loss is defined as loss_i = p_i * exp(beta * |i - y|) - 1.

The loss can also be used where outputs aren’t probabilities, so long as high values close to the ground truth position and low values away from it are desired

Parameters:
  • predictions (torch.Tensor) – a (batch x max_len) tensor in which each element is a probability, weight or some other value at that position

  • targets (torch.Tensor) – a 1-D tensor in which each element is thr ground truth

  • length (torch.Tensor) – lengths (for masking in padded batches)

  • beta (torch.Tensor) – a hyperparameter controlling the penalties. With a higher beta, penalties will increase faster

  • max_weight (torch.Tensor) – the maximum distance weight (for numerical stability in long sequences)

  • reduction (str) – Options are ‘mean’, ‘batch’, ‘batchmean’, ‘sum’. See pytorch for ‘mean’, ‘sum’. The ‘batch’ option returns one loss per item in the batch, ‘batchmean’ returns sum / batch size

Return type:

The masked loss.

Example

>>> predictions = torch.tensor(
...    [[0.25, 0.5, 0.25, 0.0],
...     [0.05, 0.05, 0.9, 0.0],
...     [8.0, 0.10, 0.05, 0.05]]
... )
>>> targets = torch.tensor([2., 3., 1.])
>>> length = torch.tensor([.75, .75, 1.])
>>> loss = distance_diff_loss(predictions, targets, length)
>>> loss
tensor(0.2967)
speechbrain.nnet.losses.truncate(predictions, targets, allowed_len_diff=3)[source]

Ensure that predictions and targets are the same length.

Parameters:
  • predictions (torch.Tensor) – First tensor for checking length.

  • targets (torch.Tensor) – Second tensor for checking length.

  • allowed_len_diff (int) – Length difference that will be tolerated before raising an exception.

Returns:

  • predictions (torch.Tensor)

  • targets (torch.Tensor) – Same as inputs, but with the same shape.

speechbrain.nnet.losses.compute_masked_loss(loss_fn, predictions, targets, length=None, label_smoothing=0.0, mask_shape='targets', reduction='mean')[source]

Compute the true average loss of a set of waveforms of unequal length.

Parameters:
  • loss_fn (function) – A function for computing the loss taking just predictions and targets. Should return all the losses, not a reduction (e.g. reduction=”none”).

  • predictions (torch.Tensor) – First argument to loss function.

  • targets (torch.Tensor) – Second argument to loss function.

  • length (torch.Tensor) – Length of each utterance to compute mask. If None, global average is computed and returned.

  • label_smoothing (float) – The proportion of label smoothing. Should only be used for NLL loss. Ref: Regularizing Neural Networks by Penalizing Confident Output Distributions. https://arxiv.org/abs/1701.06548

  • mask_shape (torch.Tensor) –

    the shape of the mask The default is “targets”, which will cause the mask to be the same shape as the targets

    Other options include “predictions” and “loss”, which will use the shape of the predictions and the unreduced loss, respectively. These are useful for loss functions that whose output does not match the shape of the targets

  • reduction (str) – One of ‘mean’, ‘batch’, ‘batchmean’, ‘none’ where ‘mean’ returns a single value and ‘batch’ returns one per item in the batch and ‘batchmean’ is sum / batch_size and ‘none’ returns all.

Return type:

The masked loss.

speechbrain.nnet.losses.compute_length_mask(data, length=None, len_dim=1)[source]

Computes a length mask for the specified data shape

Parameters:
  • data (torch.Tensor) – the data shape

  • length (torch.Tensor) – the length of the corresponding data samples

  • len_dim (int) – the length dimension (defaults to 1)

Returns:

mask – the mask

Return type:

torch.Tensor

Example

>>> data = torch.arange(5)[None, :, None].repeat(3, 1, 2)
>>> data += torch.arange(1, 4)[:, None, None]
>>> data *= torch.arange(1, 3)[None, None, :]
>>> data
tensor([[[ 1,  2],
         [ 2,  4],
         [ 3,  6],
         [ 4,  8],
         [ 5, 10]],

        [[ 2,  4],
         [ 3,  6],
         [ 4,  8],
         [ 5, 10],
         [ 6, 12]],

        [[ 3,  6],
         [ 4,  8],
         [ 5, 10],
         [ 6, 12],
         [ 7, 14]]])
>>> compute_length_mask(data, torch.tensor([1., .4, .8]))
tensor([[[1, 1],
         [1, 1],
         [1, 1],
         [1, 1],
         [1, 1]],

        [[1, 1],
         [1, 1],
         [0, 0],
         [0, 0],
         [0, 0]],

        [[1, 1],
         [1, 1],
         [1, 1],
         [1, 1],
         [0, 0]]])
>>> compute_length_mask(data, torch.tensor([.5, 1., .5]), len_dim=2)
tensor([[[1, 0],
         [1, 0],
         [1, 0],
         [1, 0],
         [1, 0]],

        [[1, 1],
         [1, 1],
         [1, 1],
         [1, 1],
         [1, 1]],

        [[1, 0],
         [1, 0],
         [1, 0],
         [1, 0],
         [1, 0]]])
speechbrain.nnet.losses.reduce_loss(loss, mask, reduction='mean', label_smoothing=0.0, predictions=None, targets=None)[source]

Performs the specified reduction of the raw loss value

Parameters:
  • loss (function) – A function for computing the loss taking just predictions and targets. Should return all the losses, not a reduction (e.g. reduction=”none”).

  • mask (torch.Tensor) – Mask to apply before computing loss.

  • reduction (str) – One of ‘mean’, ‘batch’, ‘batchmean’, ‘none’ where ‘mean’ returns a single value and ‘batch’ returns one per item in the batch and ‘batchmean’ is sum / batch_size and ‘none’ returns all.

  • label_smoothing (float) – The proportion of label smoothing. Should only be used for NLL loss. Ref: Regularizing Neural Networks by Penalizing Confident Output Distributions. https://arxiv.org/abs/1701.06548

  • predictions (torch.Tensor) – First argument to loss function. Required only if label smoothing is used.

  • targets (torch.Tensor) – Second argument to loss function. Required only if label smoothing is used.

Return type:

Reduced loss.

speechbrain.nnet.losses.get_si_snr_with_pitwrapper(source, estimate_source)[source]

This function wraps si_snr calculation with the speechbrain pit-wrapper.

Parameters:
  • source (torch.Tensor) – Shape is [B, T, C], Where B is the batch size, T is the length of the sources, C is the number of sources the ordering is made so that this loss is compatible with the class PitWrapper.

  • estimate_source (torch.Tensor) – The estimated source, of shape [B, T, C]

Returns:

loss – The computed SNR

Return type:

torch.Tensor

Example

>>> x = torch.arange(600).reshape(3, 100, 2)
>>> xhat = x[:, :, (1, 0)]
>>> si_snr = -get_si_snr_with_pitwrapper(x, xhat)
>>> print(si_snr)
tensor([135.2284, 135.2284, 135.2284])
speechbrain.nnet.losses.get_snr_with_pitwrapper(source, estimate_source)[source]

This function wraps snr calculation with the speechbrain pit-wrapper.

Parameters:
  • source (torch.Tensor) – Shape is [B, T, E, C], Where B is the batch size, T is the length of the sources, E is binaural channels, C is the number of sources the ordering is made so that this loss is compatible with the class PitWrapper.

  • estimate_source (torch.Tensor) – The estimated source, of shape [B, T, E, C]

Returns:

loss – The computed SNR

Return type:

torch.Tensor

speechbrain.nnet.losses.cal_si_snr(source, estimate_source)[source]

Calculate SI-SNR.

Parameters:
  • source (torch.Tensor) – Shape is [T, B, C], Where B is batch size, T is the length of the sources, C is the number of sources the ordering is made so that this loss is compatible with the class PitWrapper.

  • estimate_source (torch.Tensor) – The estimated source, of shape [T, B, C]

Returns:

  • The calculated SI-SNR.

  • Example

  • ———

  • >>> import numpy as np

  • >>> x = torch.Tensor([[1, 0], [123, 45], [34, 5], [2312, 421]])

  • >>> xhat = x[ (, (1, 0)])

  • >>> x = x.unsqueeze(-1).repeat(1, 1, 2)

  • >>> xhat = xhat.unsqueeze(1).repeat(1, 2, 1)

  • >>> si_snr = -cal_si_snr(x, xhat)

  • >>> print(si_snr)

  • tensor([[[ 25.2142, 144.1789], – [130.9283, 25.2142]]])

speechbrain.nnet.losses.cal_snr(source, estimate_source)[source]

Calculate binaural channel SNR.

Parameters:
  • source (torch.Tensor) – Shape is [T, E, B, C] Where B is batch size, T is the length of the sources, E is binaural channels, C is the number of sources the ordering is made so that this loss is compatible with the class PitWrapper.

  • estimate_source (torch.Tensor) – The estimated source, of shape [T, E, B, C]

Return type:

Binaural channel SNR

speechbrain.nnet.losses.get_mask(source, source_lengths)[source]
Parameters:
Returns:

mask – Shape [T, B, 1]

Return type:

torch.Tensor

Example

>>> source = torch.randn(4, 3, 2)
>>> source_lengths = torch.Tensor([2, 1, 4]).int()
>>> mask = get_mask(source, source_lengths)
>>> print(mask)
tensor([[[1.],
         [1.],
         [1.]],

        [[1.],
         [0.],
         [1.]],

        [[0.],
         [0.],
         [1.]],

        [[0.],
         [0.],
         [1.]]])
class speechbrain.nnet.losses.AngularMargin(margin=0.0, scale=1.0)[source]

Bases: Module

An implementation of Angular Margin (AM) proposed in the following paper: ‘’’Margin Matters: Towards More Discriminative Deep Neural Network Embeddings for Speaker Recognition’’’ (https://arxiv.org/abs/1906.07317)

Parameters:
  • margin (float) – The margin for cosine similarity

  • scale (float) – The scale for cosine similarity

Example

>>> pred = AngularMargin()
>>> outputs = torch.tensor([ [1., -1.], [-1., 1.], [0.9, 0.1], [0.1, 0.9] ])
>>> targets = torch.tensor([ [1., 0.], [0., 1.], [ 1., 0.], [0.,  1.] ])
>>> predictions = pred(outputs, targets)
>>> predictions[:,0] > predictions[:,1]
tensor([ True, False,  True, False])
forward(outputs, targets)[source]

Compute AM between two tensors

Parameters:
  • outputs (torch.Tensor) – The outputs of shape [N, C], cosine similarity is required.

  • targets (torch.Tensor) – The targets of shape [N, C], where the margin is applied for.

Returns:

predictions

Return type:

torch.Tensor

class speechbrain.nnet.losses.AdditiveAngularMargin(margin=0.0, scale=1.0, easy_margin=False)[source]

Bases: AngularMargin

An implementation of Additive Angular Margin (AAM) proposed in the following paper: ‘’’Margin Matters: Towards More Discriminative Deep Neural Network Embeddings for Speaker Recognition’’’ (https://arxiv.org/abs/1906.07317)

Parameters:
  • margin (float) – The margin for cosine similarity.

  • scale (float) – The scale for cosine similarity.

  • easy_margin (bool)

Example

>>> outputs = torch.tensor([ [1., -1.], [-1., 1.], [0.9, 0.1], [0.1, 0.9] ])
>>> targets = torch.tensor([ [1., 0.], [0., 1.], [ 1., 0.], [0.,  1.] ])
>>> pred = AdditiveAngularMargin()
>>> predictions = pred(outputs, targets)
>>> predictions[:,0] > predictions[:,1]
tensor([ True, False,  True, False])
forward(outputs, targets)[source]

Compute AAM between two tensors

Parameters:
  • outputs (torch.Tensor) – The outputs of shape [N, C], cosine similarity is required.

  • targets (torch.Tensor) – The targets of shape [N, C], where the margin is applied for.

Returns:

predictions

Return type:

torch.Tensor

class speechbrain.nnet.losses.LogSoftmaxWrapper(loss_fn)[source]

Bases: Module

Parameters:

loss_fn (Callable) – The LogSoftmax function to wrap.

Example

>>> outputs = torch.tensor([ [1., -1.], [-1., 1.], [0.9, 0.1], [0.1, 0.9] ])
>>> outputs = outputs.unsqueeze(1)
>>> targets = torch.tensor([ [0], [1], [0], [1] ])
>>> log_prob = LogSoftmaxWrapper(nn.Identity())
>>> loss = log_prob(outputs, targets)
>>> 0 <= loss < 1
tensor(True)
>>> log_prob = LogSoftmaxWrapper(AngularMargin(margin=0.2, scale=32))
>>> loss = log_prob(outputs, targets)
>>> 0 <= loss < 1
tensor(True)
>>> outputs = torch.tensor([ [1., -1.], [-1., 1.], [0.9, 0.1], [0.1, 0.9] ])
>>> log_prob = LogSoftmaxWrapper(AdditiveAngularMargin(margin=0.3, scale=32))
>>> loss = log_prob(outputs, targets)
>>> 0 <= loss < 1
tensor(True)
forward(outputs, targets, length=None)[source]
Parameters:
  • outputs (torch.Tensor) – Network output tensor, of shape [batch, 1, outdim].

  • targets (torch.Tensor) – Target tensor, of shape [batch, 1].

  • length (torch.Tensor) – The lengths of the corresponding inputs.

Returns:

loss – Loss for current examples.

Return type:

torch.Tensor

speechbrain.nnet.losses.ctc_loss_kd(log_probs, targets, input_lens, blank_index, device)[source]

Knowledge distillation for CTC loss.

Reference

Distilling Knowledge from Ensembles of Acoustic Models for Joint CTC-Attention End-to-End Speech Recognition. https://arxiv.org/abs/2005.09310

param log_probs:

Predicted tensor from student model, of shape [batch, time, chars].

type log_probs:

torch.Tensor

param targets:

Predicted tensor from single teacher model, of shape [batch, time, chars].

type targets:

torch.Tensor

param input_lens:

Length of each utterance.

type input_lens:

torch.Tensor

param blank_index:

The location of the blank symbol among the character indexes.

type blank_index:

int

param device:

Device for computing.

type device:

str

rtype:

The computed CTC loss.

speechbrain.nnet.losses.ce_kd(inp, target)[source]

Simple version of distillation for cross-entropy loss.

Parameters:
  • inp (torch.Tensor) – The probabilities from student model, of shape [batch_size * length, feature]

  • target (torch.Tensor) – The probabilities from teacher model, of shape [batch_size * length, feature]

Return type:

The distilled outputs.

speechbrain.nnet.losses.nll_loss_kd(probabilities, targets, rel_lab_lengths)[source]

Knowledge distillation for negative log-likelihood loss.

Reference

Distilling Knowledge from Ensembles of Acoustic Models for Joint CTC-Attention End-to-End Speech Recognition. https://arxiv.org/abs/2005.09310

param probabilities:

The predicted probabilities from the student model. Format is [batch, frames, p]

type probabilities:

torch.Tensor

param targets:

The target probabilities from the teacher model. Format is [batch, frames, p]

type targets:

torch.Tensor

param rel_lab_lengths:

Length of each utterance, if the frame-level loss is desired.

type rel_lab_lengths:

torch.Tensor

rtype:

Computed NLL KD loss.

Example

>>> probabilities = torch.tensor([[[0.8, 0.2], [0.2, 0.8]]])
>>> targets = torch.tensor([[[0.9, 0.1], [0.1, 0.9]]])
>>> rel_lab_lengths = torch.tensor([1.])
>>> nll_loss_kd(probabilities, targets, rel_lab_lengths)
tensor(-0.7400)
class speechbrain.nnet.losses.ContrastiveLoss(logit_temp)[source]

Bases: Module

Contrastive loss as used in wav2vec2.

Reference

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations https://arxiv.org/abs/2006.11477

param logit_temp:

A temperature to divide the logits.

type logit_temp:

torch.Float

forward(x, y, negs)[source]

Compute contrastive loss.

Parameters:
  • x (torch.Tensor) – Encoded embeddings with shape (B, T, C).

  • y (torch.Tensor) – Feature extractor target embeddings with shape (B, T, C).

  • negs (torch.Tensor) – Negative embeddings from feature extractor with shape (N, B, T, C) where N is number of negatives. Can be obtained with our sample_negatives function (check in lobes/wav2vec2).

Returns:

  • loss (torch.Tensor) – The computed loss

  • accuracy (torch.Tensor) – The computed accuracy

class speechbrain.nnet.losses.VariationalAutoencoderLoss(rec_loss=None, len_dim=1, dist_loss_weight=0.001)[source]

Bases: Module

The Variational Autoencoder loss, with support for length masking

From Autoencoding Variational Bayes: https://arxiv.org/pdf/1312.6114.pdf

Parameters:
  • rec_loss (callable) – a function or module to compute the reconstruction loss

  • len_dim (int) – the dimension to be used for the length, if encoding sequences of variable length

  • dist_loss_weight (float) – the relative weight of the distribution loss (K-L divergence)

Example

>>> from speechbrain.nnet.autoencoders import VariationalAutoencoderOutput
>>> vae_loss = VariationalAutoencoderLoss(dist_loss_weight=0.5)
>>> predictions = VariationalAutoencoderOutput(
...     rec=torch.tensor(
...         [[0.8, 1.0],
...          [1.2, 0.6],
...          [0.4, 1.4]]
...         ),
...     mean=torch.tensor(
...         [[0.5, 1.0],
...          [1.5, 1.0],
...          [1.0, 1.4]],
...         ),
...     log_var=torch.tensor(
...         [[0.0, -0.2],
...          [2.0, -2.0],
...          [0.2,  0.4]],
...         ),
...     latent=torch.randn(3, 1),
...     latent_sample=torch.randn(3, 1),
...     latent_length=torch.tensor([1., 1., 1.]),
... )
>>> targets = torch.tensor(
...     [[0.9, 1.1],
...      [1.4, 0.6],
...      [0.2, 1.4]]
... )
>>> loss = vae_loss(predictions, targets)
>>> loss
tensor(1.1264)
>>> details = vae_loss.details(predictions, targets)
>>> details  
VariationalAutoencoderLossDetails(loss=tensor(1.1264),
                                  rec_loss=tensor(0.0333),
                                  dist_loss=tensor(2.1861),
                                  weighted_dist_loss=tensor(1.0930))
forward(predictions, targets, length=None, reduction='batchmean')[source]

Computes the forward pass

Parameters:
Returns:

loss – the VAE loss (reconstruction + K-L divergence)

Return type:

torch.Tensor

details(predictions, targets, length=None, reduction='batchmean')[source]

Gets detailed information about the loss (useful for plotting, logs, etc.)

Parameters:
Returns:

details – a namedtuple with the following parameters loss: torch.Tensor

the combined loss

rec_loss: torch.Tensor

the reconstruction loss

dist_loss: torch.Tensor

the distribution loss (K-L divergence), raw value

weighted_dist_loss: torch.Tensor

the weighted value of the distribution loss, as used in the combined loss

Return type:

VAELossDetails

class speechbrain.nnet.losses.AutoencoderLoss(rec_loss=None, len_dim=1)[source]

Bases: Module

An implementation of a standard (non-variational) autoencoder loss

Parameters:
  • rec_loss (callable) – the callable to compute the reconstruction loss

  • len_dim (int) – the dimension index to be used for length

Example

>>> from speechbrain.nnet.autoencoders import AutoencoderOutput
>>> ae_loss = AutoencoderLoss()
>>> rec = torch.tensor(
...   [[0.8, 1.0],
...    [1.2, 0.6],
...    [0.4, 1.4]]
... )
>>> predictions = AutoencoderOutput(
...     rec=rec,
...     latent=torch.randn(3, 1),
...     latent_length=torch.tensor([1., 1.])
... )
>>> targets = torch.tensor(
...     [[0.9, 1.1],
...      [1.4, 0.6],
...      [0.2, 1.4]]
... )
>>> ae_loss(predictions, targets)
tensor(0.0333)
>>> ae_loss.details(predictions, targets)
AutoencoderLossDetails(loss=tensor(0.0333), rec_loss=tensor(0.0333))
forward(predictions, targets, length=None, reduction='batchmean')[source]

Computes the autoencoder loss

Parameters:
Return type:

The computed loss.

details(predictions, targets, length=None, reduction='batchmean')[source]

Gets detailed information about the loss (useful for plotting, logs, etc.)

This is provided mainly to make the loss interchangeable with more complex autoencoder loses, such as the VAE loss.

Parameters:
Returns:

details – a namedtuple with the following parameters loss: torch.Tensor

the combined loss

rec_loss: torch.Tensor

the reconstruction loss

Return type:

AutoencoderLossDetails

class speechbrain.nnet.losses.VariationalAutoencoderLossDetails(loss, rec_loss, dist_loss, weighted_dist_loss)

Bases: tuple

dist_loss

Alias for field number 2

loss

Alias for field number 0

rec_loss

Alias for field number 1

weighted_dist_loss

Alias for field number 3

class speechbrain.nnet.losses.AutoencoderLossDetails(loss, rec_loss)

Bases: tuple

loss

Alias for field number 0

rec_loss

Alias for field number 1

class speechbrain.nnet.losses.Laplacian(kernel_size, dtype=torch.float32)[source]

Bases: Module

Computes the Laplacian for image-like data

Parameters:
  • kernel_size (int) – the size of the Laplacian kernel

  • dtype (torch.dtype) – the data type (optional)

Example

>>> lap = Laplacian(3)
>>> lap.get_kernel()
tensor([[[[-1., -1., -1.],
          [-1.,  8., -1.],
          [-1., -1., -1.]]]])
>>> data = torch.eye(6) + torch.eye(6).flip(0)
>>> data
tensor([[1., 0., 0., 0., 0., 1.],
        [0., 1., 0., 0., 1., 0.],
        [0., 0., 1., 1., 0., 0.],
        [0., 0., 1., 1., 0., 0.],
        [0., 1., 0., 0., 1., 0.],
        [1., 0., 0., 0., 0., 1.]])
>>> lap(data.unsqueeze(0))
tensor([[[ 6., -3., -3.,  6.],
         [-3.,  4.,  4., -3.],
         [-3.,  4.,  4., -3.],
         [ 6., -3., -3.,  6.]]])
get_kernel()[source]

Computes the Laplacian kernel

forward(data)[source]

Computes the Laplacian of image-like data

Parameters:

data (torch.Tensor) – a (B x C x W x H) or (B x C x H x W) tensor with image-like data

Return type:

The transformed outputs.

class speechbrain.nnet.losses.LaplacianVarianceLoss(kernel_size=3, len_dim=1)[source]

Bases: Module

The Laplacian variance loss - used to penalize blurriness in image-like data, such as spectrograms.

The loss value will be the negative variance because the higher the variance, the sharper the image.

Parameters:
  • kernel_size (int) – the Laplacian kernel size

  • len_dim (int) – the dimension to be used as the length

Example

>>> lap_loss = LaplacianVarianceLoss(3)
>>> data = torch.ones(6, 6).unsqueeze(0)
>>> data
tensor([[[1., 1., 1., 1., 1., 1.],
         [1., 1., 1., 1., 1., 1.],
         [1., 1., 1., 1., 1., 1.],
         [1., 1., 1., 1., 1., 1.],
         [1., 1., 1., 1., 1., 1.],
         [1., 1., 1., 1., 1., 1.]]])
>>> lap_loss(data)
tensor(-0.)
>>> data = (
...     torch.eye(6) + torch.eye(6).flip(0)
... ).unsqueeze(0)
>>> data
tensor([[[1., 0., 0., 0., 0., 1.],
         [0., 1., 0., 0., 1., 0.],
         [0., 0., 1., 1., 0., 0.],
         [0., 0., 1., 1., 0., 0.],
         [0., 1., 0., 0., 1., 0.],
         [1., 0., 0., 0., 0., 1.]]])
>>> lap_loss(data)
tensor(-17.6000)
forward(predictions, length=None, reduction=None)[source]

Computes the Laplacian loss

Parameters:
  • predictions (torch.Tensor) – a (B x C x W x H) or (B x C x H x W) tensor

  • length (torch.Tensor) – The length of the corresponding inputs.

  • reduction (str) – “batch” or None

Returns:

loss – the loss value

Return type:

torch.Tensor