speechbrain.lobes.models.FastSpeech2 module

Neural network modules for the FastSpeech 2: Fast and High-Quality End-to-End Text to Speech synthesis model Authors * Sathvik Udupa 2022 * Pradnya Kandarkar 2023 * Yingzhi Wang 2023

Summary

Classes:

AlignmentNetwork

Learns the alignment between the input text and the spectrogram with Gaussian Attention.

BinaryAlignmentLoss

Binary loss that forces soft alignments to match the hard alignments as explained in https://arxiv.org/pdf/2108.10447.pdf.

DurationPredictor

Duration predictor layer

EncoderPreNet

Embedding layer for tokens

FastSpeech2

The FastSpeech2 text-to-speech model.

FastSpeech2WithAlignment

The FastSpeech2 text-to-speech model with internal alignment.

ForwardSumLoss

CTC alignment loss

Loss

Loss Computation

LossWithAlignment

Loss computation including internal aligner

PostNet

FastSpeech2 Conv Postnet :param n_mel_channels: input feature dimension for convolution layers :type n_mel_channels: int :param postnet_embedding_dim: output feature dimension for convolution layers :type postnet_embedding_dim: int :param postnet_kernel_size: postnet convolution kernel size :type postnet_kernel_size: int :param postnet_n_convolutions: number of convolution layers :type postnet_n_convolutions: int :param postnet_dropout: dropout probability for postnet :type postnet_dropout: float

SPNPredictor

This module for the silent phoneme predictor.

SSIMLoss

SSIM loss as (1 - SSIM) SSIM is explained here https://en.wikipedia.org/wiki/Structural_similarity

TextMelCollate

Zero-pads model inputs and targets based on number of frames per step

TextMelCollateWithAlignment

Zero-pads model inputs and targets based on number of frames per step result: tuple a tuple of tensors to be used as inputs/targets ( text_padded, dur_padded, input_lengths, mel_padded, output_lengths, len_x, labels, wavs )

Functions:

average_over_durations

Average values over durations.

dynamic_range_compression

Dynamic range compression for audio signals

maximum_path_numpy

Monotonic alignment search algorithm, numpy works faster than the torch implementation.

mel_spectogram

calculates MelSpectrogram for a raw audio signal

upsample

upsample encoder output according to durations

Reference

class speechbrain.lobes.models.FastSpeech2.EncoderPreNet(n_vocab, blank_id, out_channels=512)[source]

Bases: Module

Embedding layer for tokens

Parameters:
  • n_vocab (int) – size of the dictionary of embeddings

  • blank_id (int) – padding index

  • out_channels (int) – the size of each embedding vector

Example

>>> from speechbrain.nnet.embedding import Embedding
>>> from speechbrain.lobes.models.FastSpeech2 import EncoderPreNet
>>> encoder_prenet_layer = EncoderPreNet(n_vocab=40, blank_id=0, out_channels=384)
>>> x = torch.rand(3, 5)
>>> y = encoder_prenet_layer(x)
>>> y.shape
torch.Size([3, 5, 384])
forward(x)[source]

Computes the forward pass

Parameters:

x (torch.Tensor) – a (batch, tokens) input tensor

Returns:

output – the embedding layer output

Return type:

torch.Tensor

class speechbrain.lobes.models.FastSpeech2.PostNet(n_mel_channels=80, postnet_embedding_dim=512, postnet_kernel_size=5, postnet_n_convolutions=5, postnet_dropout=0.5)[source]

Bases: Module

FastSpeech2 Conv Postnet :param n_mel_channels: input feature dimension for convolution layers :type n_mel_channels: int :param postnet_embedding_dim: output feature dimension for convolution layers :type postnet_embedding_dim: int :param postnet_kernel_size: postnet convolution kernel size :type postnet_kernel_size: int :param postnet_n_convolutions: number of convolution layers :type postnet_n_convolutions: int :param postnet_dropout: dropout probability for postnet :type postnet_dropout: float

forward(x)[source]

Computes the forward pass

Parameters:

x (torch.Tensor) – a (batch, time_steps, features) input tensor

Returns:

output – the spectrogram predicted

Return type:

torch.Tensor

class speechbrain.lobes.models.FastSpeech2.DurationPredictor(in_channels, out_channels, kernel_size, dropout=0.0, n_units=1)[source]

Bases: Module

Duration predictor layer

Parameters:
  • in_channels (int) – input feature dimension for convolution layers

  • out_channels (int) – output feature dimension for convolution layers

  • kernel_size (int) – duration predictor convolution kernel size

  • dropout (float) – dropout probability, 0 by default

  • n_units (int)

Example

>>> from speechbrain.lobes.models.FastSpeech2 import FastSpeech2
>>> duration_predictor_layer = DurationPredictor(in_channels=384, out_channels=384, kernel_size=3)
>>> x = torch.randn(3, 400, 384)
>>> mask = torch.ones(3, 400, 384)
>>> y = duration_predictor_layer(x, mask)
>>> y.shape
torch.Size([3, 400, 1])
forward(x, x_mask)[source]

Computes the forward pass

Parameters:
Returns:

output – the duration predictor outputs

Return type:

torch.Tensor

class speechbrain.lobes.models.FastSpeech2.SPNPredictor(enc_num_layers, enc_num_head, enc_d_model, enc_ffn_dim, enc_k_dim, enc_v_dim, enc_dropout, normalize_before, ffn_type, ffn_cnn_kernel_size_list, n_char, padding_idx)[source]

Bases: Module

This module for the silent phoneme predictor. It receives phoneme sequences without any silent phoneme token as input and predicts whether a silent phoneme should be inserted after a position. This is to avoid the issue of fast pace at inference time due to having no silent phoneme tokens in the input sequence.

Parameters:
  • enc_num_layers (int) – number of transformer layers (TransformerEncoderLayer) in encoder

  • enc_num_head (int) – number of multi-head-attention (MHA) heads in encoder transformer layers

  • enc_d_model (int) – the number of expected features in the encoder

  • enc_ffn_dim (int) – the dimension of the feedforward network model

  • enc_k_dim (int) – the dimension of the key

  • enc_v_dim (int) – the dimension of the value

  • enc_dropout (float) – Dropout for the encoder

  • normalize_before (bool) – whether normalization should be applied before or after MHA or FFN in Transformer layers.

  • ffn_type (str) – whether to use convolutional layers instead of feed forward network inside transformer layer

  • ffn_cnn_kernel_size_list (list of int) – conv kernel size of 2 1d-convs if ffn_type is 1dcnn

  • n_char (int) – the number of symbols for the token embedding

  • padding_idx (int) – the index for padding

forward(tokens, last_phonemes)[source]

forward pass for the module

Parameters:
  • tokens (torch.Tensor) – input tokens without silent phonemes

  • last_phonemes (torch.Tensor) – indicates if a phoneme at an index is the last phoneme of a word or not

Returns:

spn_decision – indicates if a silent phoneme should be inserted after a phoneme

Return type:

torch.Tensor

infer(tokens, last_phonemes)[source]

inference function

Parameters:
  • tokens (torch.Tensor) – input tokens without silent phonemes

  • last_phonemes (torch.Tensor) – indicates if a phoneme at an index is the last phoneme of a word or not

Returns:

spn_decision – indicates if a silent phoneme should be inserted after a phoneme

Return type:

torch.Tensor

class speechbrain.lobes.models.FastSpeech2.FastSpeech2(enc_num_layers, enc_num_head, enc_d_model, enc_ffn_dim, enc_k_dim, enc_v_dim, enc_dropout, dec_num_layers, dec_num_head, dec_d_model, dec_ffn_dim, dec_k_dim, dec_v_dim, dec_dropout, normalize_before, ffn_type, ffn_cnn_kernel_size_list, n_char, n_mels, postnet_embedding_dim, postnet_kernel_size, postnet_n_convolutions, postnet_dropout, padding_idx, dur_pred_kernel_size, pitch_pred_kernel_size, energy_pred_kernel_size, variance_predictor_dropout)[source]

Bases: Module

The FastSpeech2 text-to-speech model. This class is the main entry point for the model, which is responsible for instantiating all submodules, which, in turn, manage the individual neural network layers Simplified STRUCTURE: input->token embedding ->encoder ->duration/pitch/energy predictor ->duration upsampler -> decoder -> output During training, teacher forcing is used (ground truth durations are used for upsampling)

Parameters:
  • enc_num_layers (int) – number of transformer layers (TransformerEncoderLayer) in encoder

  • enc_num_head (int) – number of multi-head-attention (MHA) heads in encoder transformer layers

  • enc_d_model (int) – the number of expected features in the encoder

  • enc_ffn_dim (int) – the dimension of the feedforward network model

  • enc_k_dim (int) – the dimension of the key

  • enc_v_dim (int) – the dimension of the value

  • enc_dropout (float) – Dropout for the encoder

  • dec_num_layers (int) – number of transformer layers (TransformerEncoderLayer) in decoder

  • dec_num_head (int) – number of multi-head-attention (MHA) heads in decoder transformer layers

  • dec_d_model (int) – the number of expected features in the decoder

  • dec_ffn_dim (int) – the dimension of the feedforward network model

  • dec_k_dim (int) – the dimension of the key

  • dec_v_dim (int) – the dimension of the value

  • dec_dropout (float) – dropout for the decoder

  • normalize_before (bool) – whether normalization should be applied before or after MHA or FFN in Transformer layers.

  • ffn_type (str) – whether to use convolutional layers instead of feed forward network inside transformer layer.

  • ffn_cnn_kernel_size_list (list of int) – conv kernel size of 2 1d-convs if ffn_type is 1dcnn

  • n_char (int) – the number of symbols for the token embedding

  • n_mels (int) – number of bins in mel spectrogram

  • postnet_embedding_dim (int) – output feature dimension for convolution layers

  • postnet_kernel_size (int) – postnet convolution kernel size

  • postnet_n_convolutions (int) – number of convolution layers

  • postnet_dropout (float) – dropout probability for postnet

  • padding_idx (int) – the index for padding

  • dur_pred_kernel_size (int) – the convolution kernel size in duration predictor

  • pitch_pred_kernel_size (int) – kernel size for pitch prediction.

  • energy_pred_kernel_size (int) – kernel size for energy prediction.

  • variance_predictor_dropout (float) – dropout probability for variance predictor (duration/pitch/energy)

Example

>>> import torch
>>> from speechbrain.lobes.models.FastSpeech2 import FastSpeech2
>>> model = FastSpeech2(
...    enc_num_layers=6,
...    enc_num_head=2,
...    enc_d_model=384,
...    enc_ffn_dim=1536,
...    enc_k_dim=384,
...    enc_v_dim=384,
...    enc_dropout=0.1,
...    dec_num_layers=6,
...    dec_num_head=2,
...    dec_d_model=384,
...    dec_ffn_dim=1536,
...    dec_k_dim=384,
...    dec_v_dim=384,
...    dec_dropout=0.1,
...    normalize_before=False,
...    ffn_type='1dcnn',
...    ffn_cnn_kernel_size_list=[9, 1],
...    n_char=40,
...    n_mels=80,
...    postnet_embedding_dim=512,
...    postnet_kernel_size=5,
...    postnet_n_convolutions=5,
...    postnet_dropout=0.5,
...    padding_idx=0,
...    dur_pred_kernel_size=3,
...    pitch_pred_kernel_size=3,
...    energy_pred_kernel_size=3,
...    variance_predictor_dropout=0.5)
>>> inputs = torch.tensor([
...     [13, 12, 31, 14, 19],
...     [31, 16, 30, 31, 0],
... ])
>>> input_lengths = torch.tensor([5, 4])
>>> durations = torch.tensor([
...     [2, 4, 1, 5, 3],
...     [1, 2, 4, 3, 0],
... ])
>>> mel_post, postnet_output, predict_durations, predict_pitch, avg_pitch, predict_energy, avg_energy, mel_lens = model(inputs, durations=durations)
>>> mel_post.shape, predict_durations.shape
(torch.Size([2, 15, 80]), torch.Size([2, 5]))
>>> predict_pitch.shape, predict_energy.shape
(torch.Size([2, 5, 1]), torch.Size([2, 5, 1]))
forward(tokens, durations=None, pitch=None, energy=None, pace=1.0, pitch_rate=1.0, energy_rate=1.0)[source]

forward pass for training and inference

Parameters:
  • tokens (torch.Tensor) – batch of input tokens

  • durations (torch.Tensor) – batch of durations for each token. If it is None, the model will infer on predicted durations

  • pitch (torch.Tensor) – batch of pitch for each frame. If it is None, the model will infer on predicted pitches

  • energy (torch.Tensor) – batch of energy for each frame. If it is None, the model will infer on predicted energies

  • pace (float) – scaling factor for durations

  • pitch_rate (float) – scaling factor for pitches

  • energy_rate (float) – scaling factor for energies

Returns:

  • mel_post (torch.Tensor) – mel outputs from the decoder

  • postnet_output (torch.Tensor) – mel outputs from the postnet

  • predict_durations (torch.Tensor) – predicted durations of each token

  • predict_pitch (torch.Tensor) – predicted pitches of each token

  • avg_pitch (torch.Tensor) – target pitches for each token if input pitch is not None None if input pitch is None

  • predict_energy (torch.Tensor) – predicted energies of each token

  • avg_energy (torch.Tensor) – target energies for each token if input energy is not None None if input energy is None

  • mel_length – predicted lengths of mel spectrograms

speechbrain.lobes.models.FastSpeech2.average_over_durations(values, durs)[source]

Average values over durations.

Parameters:
Returns:

avg – shape: [B, 1, T_en]

Return type:

torch.Tensor

speechbrain.lobes.models.FastSpeech2.upsample(feats, durs, pace=1.0, padding_value=0.0)[source]

upsample encoder output according to durations

Parameters:
  • feats (torch.Tensor) – batch of input tokens

  • durs (torch.Tensor) – durations to be used to upsample

  • pace (float) – scaling factor for durations

  • padding_value (int) – padding index

Returns:

  • mel_post (torch.Tensor) – mel outputs from the decoder

  • predict_durations (torch.Tensor) – predicted durations for each token

class speechbrain.lobes.models.FastSpeech2.TextMelCollate[source]

Bases: object

Zero-pads model inputs and targets based on number of frames per step

__call__(batch)[source]

Collate’s training batch from normalized text and mel-spectrogram

Parameters:

batch (list) – [text_normalized, mel_normalized]

Returns:

  • text_padded (torch.Tensor)

  • dur_padded (torch.Tensor)

  • input_lengths (torch.Tensor)

  • mel_padded (torch.Tensor)

  • pitch_padded (torch.Tensor)

  • energy_padded (torch.Tensor)

  • output_lengths (torch.Tensor)

  • len_x (torch.Tensor)

  • labels (torch.Tensor)

  • wavs (torch.Tensor)

  • no_spn_seq_padded (torch.Tensor)

  • spn_labels_padded (torch.Tensor)

  • last_phonemes_padded (torch.Tensor)

class speechbrain.lobes.models.FastSpeech2.Loss(log_scale_durations, ssim_loss_weight, duration_loss_weight, pitch_loss_weight, energy_loss_weight, mel_loss_weight, postnet_mel_loss_weight, spn_loss_weight=1.0, spn_loss_max_epochs=8)[source]

Bases: Module

Loss Computation

Parameters:
  • log_scale_durations (bool) – applies logarithm to target durations

  • ssim_loss_weight (float) – weight for ssim loss

  • duration_loss_weight (float) – weight for the duration loss

  • pitch_loss_weight (float) – weight for the pitch loss

  • energy_loss_weight (float) – weight for the energy loss

  • mel_loss_weight (float) – weight for the mel loss

  • postnet_mel_loss_weight (float) – weight for the postnet mel loss

  • spn_loss_weight (float) – weight for spn loss

  • spn_loss_max_epochs (int) – Max number of epochs

forward(predictions, targets, current_epoch)[source]

Computes the value of the loss function and updates stats

Parameters:
  • predictions (tuple) – model predictions

  • targets (tuple) – ground truth data

  • current_epoch (int) – The count of the current epoch.

Returns:

loss – the loss value

Return type:

torch.Tensor

speechbrain.lobes.models.FastSpeech2.mel_spectogram(sample_rate, hop_length, win_length, n_fft, n_mels, f_min, f_max, power, normalized, min_max_energy_norm, norm, mel_scale, compression, audio)[source]

calculates MelSpectrogram for a raw audio signal

Parameters:
  • sample_rate (int) – Sample rate of audio signal.

  • hop_length (int) – Length of hop between STFT windows.

  • win_length (int) – Window size.

  • n_fft (int) – Size of FFT.

  • n_mels (int) – Number of mel filterbanks.

  • f_min (float) – Minimum frequency.

  • f_max (float) – Maximum frequency.

  • power (float) – Exponent for the magnitude spectrogram.

  • normalized (bool) – Whether to normalize by magnitude after stft.

  • min_max_energy_norm (bool) – Whether to normalize by min-max

  • norm (str or None) – If “slaney”, divide the triangular mel weights by the width of the mel band

  • mel_scale (str) – Scale to use: “htk” or “slaney”.

  • compression (bool) – whether to do dynamic range compression

  • audio (torch.Tensor) – input audio signal

Returns:

  • mel (torch.Tensor)

  • rmse (torch.Tensor)

speechbrain.lobes.models.FastSpeech2.dynamic_range_compression(x, C=1, clip_val=1e-05)[source]

Dynamic range compression for audio signals

class speechbrain.lobes.models.FastSpeech2.SSIMLoss[source]

Bases: Module

SSIM loss as (1 - SSIM) SSIM is explained here https://en.wikipedia.org/wiki/Structural_similarity

sequence_mask(sequence_length, max_len=None)[source]

Create a sequence mask for filtering padding in a sequence tensor.

Parameters:
  • sequence_length (torch.Tensor) – Sequence lengths.

  • max_len (int) – Maximum sequence length. Defaults to None.

Returns:

mask

Return type:

[B, T_max]

sample_wise_min_max(x: Tensor, mask: Tensor)[source]

Min-Max normalize tensor through first dimension

Parameters:
Return type:

Normalized tensor

forward(y_hat, y, length)[source]
Parameters:
Returns:

loss

Return type:

Average loss value in range [0, 1] masked by the length.

class speechbrain.lobes.models.FastSpeech2.TextMelCollateWithAlignment[source]

Bases: object

Zero-pads model inputs and targets based on number of frames per step result: tuple

a tuple of tensors to be used as inputs/targets (

text_padded, dur_padded, input_lengths, mel_padded, output_lengths, len_x, labels, wavs

)

__call__(batch)[source]

Collate’s training batch from normalized text and mel-spectrogram

Parameters:

batch (list) – [text_normalized, mel_normalized]

Returns:

  • phoneme_padded (torch.Tensor)

  • input_lengths (torch.Tensor)

  • mel_padded (torch.Tensor)

  • pitch_padded (torch.Tensor)

  • energy_padded (torch.Tensor)

  • output_lengths (torch.Tensor)

  • labels (torch.Tensor)

  • wavs (torch.Tensor)

speechbrain.lobes.models.FastSpeech2.maximum_path_numpy(value, mask)[source]

Monotonic alignment search algorithm, numpy works faster than the torch implementation.

Parameters:
  • value (torch.Tensor) – input alignment values [b, t_x, t_y]

  • mask (torch.Tensor) – input alignment mask [b, t_x, t_y]

Returns:

path

Return type:

torch.Tensor

Example

>>> import torch
>>> from speechbrain.lobes.models.FastSpeech2 import maximum_path_numpy
>>> alignment = torch.rand(2, 5, 100)
>>> mask = torch.ones(2, 5, 100)
>>> hard_alignments = maximum_path_numpy(alignment, mask)
class speechbrain.lobes.models.FastSpeech2.AlignmentNetwork(in_query_channels=80, in_key_channels=512, attn_channels=80, temperature=0.0005)[source]

Bases: Module

Learns the alignment between the input text and the spectrogram with Gaussian Attention.

query -> conv1d -> relu -> conv1d -> relu -> conv1d -> L2_dist -> softmax -> alignment key -> conv1d -> relu -> conv1d - - - - - - - - - - - -^

Parameters:
  • in_query_channels (int) – Number of channels in the query network. Defaults to 80.

  • in_key_channels (int) – Number of channels in the key network. Defaults to 512.

  • attn_channels (int) – Number of inner channels in the attention layers. Defaults to 80.

  • temperature (float) – Temperature for the softmax. Defaults to 0.0005.

Example

>>> import torch
>>> from speechbrain.lobes.models.FastSpeech2 import AlignmentNetwork
>>> aligner = AlignmentNetwork(
...     in_query_channels=80,
...     in_key_channels=512,
...     attn_channels=80,
...     temperature=0.0005,
... )
>>> phoneme_feats = torch.rand(2, 512, 20)
>>> mels = torch.rand(2, 80, 100)
>>> alignment_soft, alignment_logprob = aligner(mels, phoneme_feats, None, None)
>>> alignment_soft.shape, alignment_logprob.shape
(torch.Size([2, 1, 100, 20]), torch.Size([2, 1, 100, 20]))
forward(queries, keys, mask, attn_prior)[source]

Forward pass of the aligner encoder.

Parameters:
  • queries (torch.Tensor) – the query tensor [B, C, T_de]

  • keys (torch.Tensor) – the query tensor [B, C_emb, T_en]

  • mask (torch.Tensor) – the query mask[B, T_de]

  • attn_prior (torch.Tensor) – the prior attention tensor [B, 1, T_en, T_de]

Returns:

  • attn (torch.Tensor) – soft attention [B, 1, T_en, T_de]

  • attn_logp (torch.Tensor) – log probabilities [B, 1, T_en , T_de]

class speechbrain.lobes.models.FastSpeech2.FastSpeech2WithAlignment(enc_num_layers, enc_num_head, enc_d_model, enc_ffn_dim, enc_k_dim, enc_v_dim, enc_dropout, in_query_channels, in_key_channels, attn_channels, temperature, dec_num_layers, dec_num_head, dec_d_model, dec_ffn_dim, dec_k_dim, dec_v_dim, dec_dropout, normalize_before, ffn_type, ffn_cnn_kernel_size_list, n_char, n_mels, postnet_embedding_dim, postnet_kernel_size, postnet_n_convolutions, postnet_dropout, padding_idx, dur_pred_kernel_size, pitch_pred_kernel_size, energy_pred_kernel_size, variance_predictor_dropout)[source]

Bases: Module

The FastSpeech2 text-to-speech model with internal alignment. This class is the main entry point for the model, which is responsible for instantiating all submodules, which, in turn, manage the individual neural network layers. Certain parts are adopted from the following implementation: https://github.com/coqui-ai/TTS/blob/dev/TTS/tts/models/forward_tts.py

Simplified STRUCTURE: input -> token embedding -> encoder -> aligner -> duration/pitch/energy -> upsampler -> decoder -> output

Parameters:
  • enc_num_layers (int) – number of transformer layers (TransformerEncoderLayer) in encoder

  • enc_num_head (int) – number of multi-head-attention (MHA) heads in encoder transformer layers

  • enc_d_model (int) – the number of expected features in the encoder

  • enc_ffn_dim (int) – the dimension of the feedforward network model

  • enc_k_dim (int) – the dimension of the key

  • enc_v_dim (int) – the dimension of the value

  • enc_dropout (float) – Dropout for the encoder

  • in_query_channels (int) – Number of channels in the query network.

  • in_key_channels (int) – Number of channels in the key network.

  • attn_channels (int) – Number of inner channels in the attention layers.

  • temperature (float) – Temperature for the softmax.

  • dec_num_layers (int) – number of transformer layers (TransformerEncoderLayer) in decoder

  • dec_num_head (int) – number of multi-head-attention (MHA) heads in decoder transformer layers

  • dec_d_model (int) – the number of expected features in the decoder

  • dec_ffn_dim (int) – the dimension of the feedforward network model

  • dec_k_dim (int) – the dimension of the key

  • dec_v_dim (int) – the dimension of the value

  • dec_dropout (float) – dropout for the decoder

  • normalize_before (bool) – whether normalization should be applied before or after MHA or FFN in Transformer layers.

  • ffn_type (str) – whether to use convolutional layers instead of feed forward network inside transformer layer.

  • ffn_cnn_kernel_size_list (list of int) – conv kernel size of 2 1d-convs if ffn_type is 1dcnn

  • n_char (int) – the number of symbols for the token embedding

  • n_mels (int) – number of bins in mel spectrogram

  • postnet_embedding_dim (int) – output feature dimension for convolution layers

  • postnet_kernel_size (int) – postnet convolution kernel size

  • postnet_n_convolutions (int) – number of convolution layers

  • postnet_dropout (float) – dropout probability for postnet

  • padding_idx (int) – the index for padding

  • dur_pred_kernel_size (int) – the convolution kernel size in duration predictor

  • pitch_pred_kernel_size (int) – kernel size for pitch prediction.

  • energy_pred_kernel_size (int) – kernel size for energy prediction.

  • variance_predictor_dropout (float) – dropout probability for variance predictor (duration/pitch/energy)

Example

>>> import torch
>>> from speechbrain.lobes.models.FastSpeech2 import FastSpeech2WithAlignment
>>> model = FastSpeech2WithAlignment(
...    enc_num_layers=6,
...    enc_num_head=2,
...    enc_d_model=384,
...    enc_ffn_dim=1536,
...    enc_k_dim=384,
...    enc_v_dim=384,
...    enc_dropout=0.1,
...    in_query_channels=80,
...    in_key_channels=384,
...    attn_channels=80,
...    temperature=0.0005,
...    dec_num_layers=6,
...    dec_num_head=2,
...    dec_d_model=384,
...    dec_ffn_dim=1536,
...    dec_k_dim=384,
...    dec_v_dim=384,
...    dec_dropout=0.1,
...    normalize_before=False,
...    ffn_type='1dcnn',
...    ffn_cnn_kernel_size_list=[9, 1],
...    n_char=40,
...    n_mels=80,
...    postnet_embedding_dim=512,
...    postnet_kernel_size=5,
...    postnet_n_convolutions=5,
...    postnet_dropout=0.5,
...    padding_idx=0,
...    dur_pred_kernel_size=3,
...    pitch_pred_kernel_size=3,
...    energy_pred_kernel_size=3,
...    variance_predictor_dropout=0.5)
>>> inputs = torch.tensor([
...     [13, 12, 31, 14, 19],
...     [31, 16, 30, 31, 0],
... ])
>>> mels = torch.rand(2, 100, 80)
>>> mel_post, postnet_output, durations, predict_pitch, avg_pitch, predict_energy, avg_energy, mel_lens, alignment_durations, alignment_soft, alignment_logprob, alignment_mas = model(inputs, mels)
>>> mel_post.shape, durations.shape
(torch.Size([2, 100, 80]), torch.Size([2, 5]))
>>> predict_pitch.shape, predict_energy.shape
(torch.Size([2, 5, 1]), torch.Size([2, 5, 1]))
>>> alignment_soft.shape, alignment_mas.shape
(torch.Size([2, 100, 5]), torch.Size([2, 100, 5]))
forward(tokens, mel_spectograms=None, pitch=None, energy=None, pace=1.0, pitch_rate=1.0, energy_rate=1.0)[source]

forward pass for training and inference

Parameters:
  • tokens (torch.Tensor) – batch of input tokens

  • mel_spectograms (torch.Tensor) – batch of mel_spectograms (used only for training)

  • pitch (torch.Tensor) – batch of pitch for each frame. If it is None, the model will infer on predicted pitches

  • energy (torch.Tensor) – batch of energy for each frame. If it is None, the model will infer on predicted energies

  • pace (float) – scaling factor for durations

  • pitch_rate (float) – scaling factor for pitches

  • energy_rate (float) – scaling factor for energies

Returns:

  • mel_post (torch.Tensor) – mel outputs from the decoder

  • postnet_output (torch.Tensor) – mel outputs from the postnet

  • predict_durations (torch.Tensor) – predicted durations of each token

  • predict_pitch (torch.Tensor) – predicted pitches of each token

  • avg_pitch (torch.Tensor) – target pitches for each token if input pitch is not None None if input pitch is None

  • predict_energy (torch.Tensor) – predicted energies of each token

  • avg_energy (torch.Tensor) – target energies for each token if input energy is not None None if input energy is None

  • mel_length – predicted lengths of mel spectrograms

  • alignment_durations – durations from the hard alignment map

  • alignment_soft (torch.Tensor) – soft alignment potentials

  • alignment_logprob (torch.Tensor) – log scale alignment potentials

  • alignment_mas (torch.Tensor) – hard alignment map

class speechbrain.lobes.models.FastSpeech2.LossWithAlignment(log_scale_durations, ssim_loss_weight, duration_loss_weight, pitch_loss_weight, energy_loss_weight, mel_loss_weight, postnet_mel_loss_weight, aligner_loss_weight, binary_alignment_loss_weight, binary_alignment_loss_warmup_epochs, binary_alignment_loss_max_epochs)[source]

Bases: Module

Loss computation including internal aligner

Parameters:
  • log_scale_durations (bool) – applies logarithm to target durations

  • ssim_loss_weight (float) – weight for the ssim loss

  • duration_loss_weight (float) – weight for the duration loss

  • pitch_loss_weight (float) – weight for the pitch loss

  • energy_loss_weight (float) – weight for the energy loss

  • mel_loss_weight (float) – weight for the mel loss

  • postnet_mel_loss_weight (float) – weight for the postnet mel loss

  • aligner_loss_weight (float) – weight for the alignment loss

  • binary_alignment_loss_weight (float) – weight for the postnet mel loss

  • binary_alignment_loss_warmup_epochs (int) – Number of epochs to gradually increase the impact of binary loss.

  • binary_alignment_loss_max_epochs (int) – From this epoch on the impact of binary loss is ignored.

forward(predictions, targets, current_epoch)[source]

Computes the value of the loss function and updates stats

Parameters:
  • predictions (tuple) – model predictions

  • targets (tuple) – ground truth data

  • current_epoch (int) – used to determinate the start/end of the binary alignment loss

Returns:

loss – the loss value

Return type:

torch.Tensor

class speechbrain.lobes.models.FastSpeech2.ForwardSumLoss(blank_logprob=-1)[source]

Bases: Module

CTC alignment loss

Parameters:

blank_logprob (pad value)

Example

>>> import torch
>>> from speechbrain.lobes.models.FastSpeech2 import ForwardSumLoss
>>> loss_func = ForwardSumLoss()
>>> attn_logprob = torch.rand(2, 1, 100, 5)
>>> key_lens = torch.tensor([5, 5])
>>> query_lens = torch.tensor([100, 100])
>>> loss = loss_func(attn_logprob, key_lens, query_lens)
forward(attn_logprob, key_lens, query_lens)[source]
Parameters:
Returns:

total_loss

Return type:

torch.Tensor

class speechbrain.lobes.models.FastSpeech2.BinaryAlignmentLoss[source]

Bases: Module

Binary loss that forces soft alignments to match the hard alignments as explained in https://arxiv.org/pdf/2108.10447.pdf. .. rubric:: Example

>>> import torch
>>> from speechbrain.lobes.models.FastSpeech2 import BinaryAlignmentLoss
>>> loss_func = BinaryAlignmentLoss()
>>> alignment_hard = torch.randint(0, 2, (2, 100, 5))
>>> alignment_soft = torch.rand(2, 100, 5)
>>> loss = loss_func(alignment_hard, alignment_soft)
forward(alignment_hard, alignment_soft)[source]
alignment_hard: torch.Tensor

hard alignment map [B, mel_lens, phoneme_lens]

alignment_soft: torch.Tensor

soft alignment potentials [B, mel_lens, phoneme_lens]