speechbrain.lobes.models.FastSpeech2 module

Neural network modules for the FastSpeech 2: Fast and High-Quality End-to-End Text to Speech synthesis model Authors * Sathvik Udupa 2022 * Pradnya Kandarkar 2023 * Yingzhi Wang 2023

Summary

Classes:

`AlignmentNetwork`	Learns the alignment between the input text and the spectrogram with Gaussian Attention.
`BinaryAlignmentLoss`	Binary loss that forces soft alignments to match the hard alignments as explained in `https://arxiv.org/pdf/2108.10447.pdf`.
`DurationPredictor`	Duration predictor layer
`EncoderPreNet`	Embedding layer for tokens
`FastSpeech2`	The FastSpeech2 text-to-speech model.
`FastSpeech2WithAlignment`	The FastSpeech2 text-to-speech model with internal alignment.
`ForwardSumLoss`	CTC alignment loss
`Loss`	Loss Computation
`LossWithAlignment`	Loss computation including internal aligner
`PostNet`	FastSpeech2 Conv Postnet :param n_mel_channels: input feature dimension for convolution layers :type n_mel_channels: int :param postnet_embedding_dim: output feature dimension for convolution layers :type postnet_embedding_dim: int :param postnet_kernel_size: postnet convolution kernel size :type postnet_kernel_size: int :param postnet_n_convolutions: number of convolution layers :type postnet_n_convolutions: int :param postnet_dropout: dropout probability for postnet :type postnet_dropout: float
`SPNPredictor`	This module for the silent phoneme predictor.
`SSIMLoss`	SSIM loss as (1 - SSIM) SSIM is explained here https://en.wikipedia.org/wiki/Structural_similarity
`TextMelCollate`	Zero-pads model inputs and targets based on number of frames per step
`TextMelCollateWithAlignment`	Zero-pads model inputs and targets based on number of frames per step result: tuple a tuple of tensors to be used as inputs/targets ( text_padded, dur_padded, input_lengths, mel_padded, output_lengths, len_x, labels, wavs )

Functions:

`average_over_durations`	Average values over durations.
`dynamic_range_compression`	Dynamic range compression for audio signals
`maximum_path_numpy`	Monotonic alignment search algorithm, numpy works faster than the torch implementation.
`mel_spectogram`	calculates MelSpectrogram for a raw audio signal
`upsample`	upsample encoder output according to durations

Reference

class speechbrain.lobes.models.FastSpeech2.EncoderPreNet(n_vocab, blank_id, out_channels=512)[source]

Bases: Module

Embedding layer for tokens

Parameters:

n_vocab (int) – size of the dictionary of embeddings
blank_id (int) – padding index
out_channels (int) – the size of each embedding vector

Example

>>> from speechbrain.nnet.embedding import Embedding
>>> from speechbrain.lobes.models.FastSpeech2 import EncoderPreNet
>>> encoder_prenet_layer = EncoderPreNet(n_vocab=40, blank_id=0, out_channels=384)
>>> x = torch.rand(3, 5)
>>> y = encoder_prenet_layer(x)
>>> y.shape
torch.Size([3, 5, 384])

forward(x)[source]

Computes the forward pass

Parameters:: x (torch.Tensor) – a (batch, tokens) input tensor
Returns:: output – the embedding layer output
Return type:: torch.Tensor

class speechbrain.lobes.models.FastSpeech2.PostNet(n_mel_channels=80, postnet_embedding_dim=512, postnet_kernel_size=5, postnet_n_convolutions=5, postnet_dropout=0.5)[source]

Bases: Module

FastSpeech2 Conv Postnet :param n_mel_channels: input feature dimension for convolution layers :type n_mel_channels: int :param postnet_embedding_dim: output feature dimension for convolution layers :type postnet_embedding_dim: int :param postnet_kernel_size: postnet convolution kernel size :type postnet_kernel_size: int :param postnet_n_convolutions: number of convolution layers :type postnet_n_convolutions: int :param postnet_dropout: dropout probability for postnet :type postnet_dropout: float

forward(x)[source]

Computes the forward pass

Parameters:: x (torch.Tensor) – a (batch, time_steps, features) input tensor
Returns:: output – the spectrogram predicted
Return type:: torch.Tensor

class speechbrain.lobes.models.FastSpeech2.DurationPredictor(in_channels, out_channels, kernel_size, dropout=0.0, n_units=1)[source]

Bases: Module

Duration predictor layer

Parameters:

in_channels (int) – input feature dimension for convolution layers
out_channels (int) – output feature dimension for convolution layers
kernel_size (int) – duration predictor convolution kernel size
dropout (float) – dropout probability, 0 by default
n_units (int)

Example

>>> from speechbrain.lobes.models.FastSpeech2 import FastSpeech2
>>> duration_predictor_layer = DurationPredictor(in_channels=384, out_channels=384, kernel_size=3)
>>> x = torch.randn(3, 400, 384)
>>> mask = torch.ones(3, 400, 384)
>>> y = duration_predictor_layer(x, mask)
>>> y.shape
torch.Size([3, 400, 1])

forward(x, x_mask)[source]

Computes the forward pass

Parameters:

x (torch.Tensor) – a (batch, time_steps, features) input tensor
x_mask (torch.Tensor) – mask of input tensor

Returns:

output – the duration predictor outputs

Return type:

torch.Tensor

class speechbrain.lobes.models.FastSpeech2.SPNPredictor(enc_num_layers, enc_num_head, enc_d_model, enc_ffn_dim, enc_k_dim, enc_v_dim, enc_dropout, normalize_before, ffn_type, ffn_cnn_kernel_size_list, n_char, padding_idx)[source]

Bases: Module

This module for the silent phoneme predictor. It receives phoneme sequences without any silent phoneme token as input and predicts whether a silent phoneme should be inserted after a position. This is to avoid the issue of fast pace at inference time due to having no silent phoneme tokens in the input sequence.

Parameters:

enc_num_layers (int) – number of transformer layers (TransformerEncoderLayer) in encoder
enc_num_head (int) – number of multi-head-attention (MHA) heads in encoder transformer layers
enc_d_model (int) – the number of expected features in the encoder
enc_ffn_dim (int) – the dimension of the feedforward network model
enc_k_dim (int) – the dimension of the key
enc_v_dim (int) – the dimension of the value
enc_dropout (float) – Dropout for the encoder
normalize_before (bool) – whether normalization should be applied before or after MHA or FFN in Transformer layers.
ffn_type (str) – whether to use convolutional layers instead of feed forward network inside transformer layer
ffn_cnn_kernel_size_list (list of int) – conv kernel size of 2 1d-convs if ffn_type is 1dcnn
n_char (int) – the number of symbols for the token embedding
padding_idx (int) – the index for padding

forward(tokens, last_phonemes)[source]

forward pass for the module

Parameters:

tokens (torch.Tensor) – input tokens without silent phonemes
last_phonemes (torch.Tensor) – indicates if a phoneme at an index is the last phoneme of a word or not

Returns:

spn_decision – indicates if a silent phoneme should be inserted after a phoneme

Return type:

torch.Tensor

infer(tokens, last_phonemes)[source]

inference function

Parameters:

tokens (torch.Tensor) – input tokens without silent phonemes
last_phonemes (torch.Tensor) – indicates if a phoneme at an index is the last phoneme of a word or not

Returns:

spn_decision – indicates if a silent phoneme should be inserted after a phoneme

Return type:

torch.Tensor

class speechbrain.lobes.models.FastSpeech2.FastSpeech2(enc_num_layers, enc_num_head, enc_d_model, enc_ffn_dim, enc_k_dim, enc_v_dim, enc_dropout, dec_num_layers, dec_num_head, dec_d_model, dec_ffn_dim, dec_k_dim, dec_v_dim, dec_dropout, normalize_before, ffn_type, ffn_cnn_kernel_size_list, n_char, n_mels, postnet_embedding_dim, postnet_kernel_size, postnet_n_convolutions, postnet_dropout, padding_idx, dur_pred_kernel_size, pitch_pred_kernel_size, energy_pred_kernel_size, variance_predictor_dropout)[source]

Bases: Module

The FastSpeech2 text-to-speech model. This class is the main entry point for the model, which is responsible for instantiating all submodules, which, in turn, manage the individual neural network layers Simplified STRUCTURE: input->token embedding ->encoder ->duration/pitch/energy predictor ->duration upsampler -> decoder -> output During training, teacher forcing is used (ground truth durations are used for upsampling)

Parameters:

enc_num_layers (int) – number of transformer layers (TransformerEncoderLayer) in encoder
enc_num_head (int) – number of multi-head-attention (MHA) heads in encoder transformer layers
enc_d_model (int) – the number of expected features in the encoder
enc_ffn_dim (int) – the dimension of the feedforward network model
enc_k_dim (int) – the dimension of the key
enc_v_dim (int) – the dimension of the value
enc_dropout (float) – Dropout for the encoder
dec_num_layers (int) – number of transformer layers (TransformerEncoderLayer) in decoder
dec_num_head (int) – number of multi-head-attention (MHA) heads in decoder transformer layers
dec_d_model (int) – the number of expected features in the decoder
dec_ffn_dim (int) – the dimension of the feedforward network model
dec_k_dim (int) – the dimension of the key
dec_v_dim (int) – the dimension of the value
dec_dropout (float) – dropout for the decoder
normalize_before (bool) – whether normalization should be applied before or after MHA or FFN in Transformer layers.
ffn_type (str) – whether to use convolutional layers instead of feed forward network inside transformer layer.
ffn_cnn_kernel_size_list (list of int) – conv kernel size of 2 1d-convs if ffn_type is 1dcnn
n_char (int) – the number of symbols for the token embedding
n_mels (int) – number of bins in mel spectrogram
postnet_embedding_dim (int) – output feature dimension for convolution layers
postnet_kernel_size (int) – postnet convolution kernel size
postnet_n_convolutions (int) – number of convolution layers
postnet_dropout (float) – dropout probability for postnet
padding_idx (int) – the index for padding
dur_pred_kernel_size (int) – the convolution kernel size in duration predictor
pitch_pred_kernel_size (int) – kernel size for pitch prediction.
energy_pred_kernel_size (int) – kernel size for energy prediction.
variance_predictor_dropout (float) – dropout probability for variance predictor (duration/pitch/energy)

Example

>>> import torch
>>> from speechbrain.lobes.models.FastSpeech2 import FastSpeech2
>>> model = FastSpeech2(
...    enc_num_layers=6,
...    enc_num_head=2,
...    enc_d_model=384,
...    enc_ffn_dim=1536,
...    enc_k_dim=384,
...    enc_v_dim=384,
...    enc_dropout=0.1,
...    dec_num_layers=6,
...    dec_num_head=2,
...    dec_d_model=384,
...    dec_ffn_dim=1536,
...    dec_k_dim=384,
...    dec_v_dim=384,
...    dec_dropout=0.1,
...    normalize_before=False,
...    ffn_type='1dcnn',
...    ffn_cnn_kernel_size_list=[9, 1],
...    n_char=40,
...    n_mels=80,
...    postnet_embedding_dim=512,
...    postnet_kernel_size=5,
...    postnet_n_convolutions=5,
...    postnet_dropout=0.5,
...    padding_idx=0,
...    dur_pred_kernel_size=3,
...    pitch_pred_kernel_size=3,
...    energy_pred_kernel_size=3,
...    variance_predictor_dropout=0.5)
>>> inputs = torch.tensor([
...     [13, 12, 31, 14, 19],
...     [31, 16, 30, 31, 0],
... ])
>>> input_lengths = torch.tensor([5, 4])
>>> durations = torch.tensor([
...     [2, 4, 1, 5, 3],
...     [1, 2, 4, 3, 0],
... ])
>>> mel_post, postnet_output, predict_durations, predict_pitch, avg_pitch, predict_energy, avg_energy, mel_lens = model(inputs, durations=durations)
>>> mel_post.shape, predict_durations.shape
(torch.Size([2, 15, 80]), torch.Size([2, 5]))
>>> predict_pitch.shape, predict_energy.shape
(torch.Size([2, 5, 1]), torch.Size([2, 5, 1]))

forward(tokens, durations=None, pitch=None, energy=None, pace=1.0, pitch_rate=1.0, energy_rate=1.0)[source]

forward pass for training and inference

Parameters:

tokens (torch.Tensor) – batch of input tokens
durations (torch.Tensor) – batch of durations for each token. If it is None, the model will infer on predicted durations
pitch (torch.Tensor) – batch of pitch for each frame. If it is None, the model will infer on predicted pitches
energy (torch.Tensor) – batch of energy for each frame. If it is None, the model will infer on predicted energies
pace (float) – scaling factor for durations
pitch_rate (float) – scaling factor for pitches
energy_rate (float) – scaling factor for energies

Returns:

mel_post (torch.Tensor) – mel outputs from the decoder
postnet_output (torch.Tensor) – mel outputs from the postnet
predict_durations (torch.Tensor) – predicted durations of each token
predict_pitch (torch.Tensor) – predicted pitches of each token
avg_pitch (torch.Tensor) – target pitches for each token if input pitch is not None None if input pitch is None
predict_energy (torch.Tensor) – predicted energies of each token
avg_energy (torch.Tensor) – target energies for each token if input energy is not None None if input energy is None
mel_length – predicted lengths of mel spectrograms

speechbrain.lobes.models.FastSpeech2.average_over_durations(values, durs)[source]

Average values over durations.

Parameters:

values (torch.Tensor) – shape: [B, 1, T_de]
durs (torch.Tensor) – shape: [B, T_en]

Returns:

avg – shape: [B, 1, T_en]

Return type:

torch.Tensor

speechbrain.lobes.models.FastSpeech2.upsample(feats, durs, pace=1.0, padding_value=0.0)[source]

upsample encoder output according to durations

Parameters:

feats (torch.Tensor) – batch of input tokens
durs (torch.Tensor) – durations to be used to upsample
pace (float) – scaling factor for durations
padding_value (int) – padding index

Returns:

mel_post (torch.Tensor) – mel outputs from the decoder
predict_durations (torch.Tensor) – predicted durations for each token

class speechbrain.lobes.models.FastSpeech2.TextMelCollate[source]

Bases: object

Zero-pads model inputs and targets based on number of frames per step

__call__(batch)[source]

Collate’s training batch from normalized text and mel-spectrogram

Parameters:

batch (list) – [text_normalized, mel_normalized]

Returns:

text_padded (torch.Tensor)
dur_padded (torch.Tensor)
input_lengths (torch.Tensor)
mel_padded (torch.Tensor)
pitch_padded (torch.Tensor)
energy_padded (torch.Tensor)
output_lengths (torch.Tensor)
len_x (torch.Tensor)
labels (torch.Tensor)
wavs (torch.Tensor)
no_spn_seq_padded (torch.Tensor)
spn_labels_padded (torch.Tensor)
last_phonemes_padded (torch.Tensor)

class speechbrain.lobes.models.FastSpeech2.Loss(log_scale_durations, ssim_loss_weight, duration_loss_weight, pitch_loss_weight, energy_loss_weight, mel_loss_weight, postnet_mel_loss_weight, spn_loss_weight=1.0, spn_loss_max_epochs=8)[source]

Bases: Module

Loss Computation

Parameters:

log_scale_durations (bool) – applies logarithm to target durations
ssim_loss_weight (float) – weight for ssim loss
duration_loss_weight (float) – weight for the duration loss
pitch_loss_weight (float) – weight for the pitch loss
energy_loss_weight (float) – weight for the energy loss
mel_loss_weight (float) – weight for the mel loss
postnet_mel_loss_weight (float) – weight for the postnet mel loss
spn_loss_weight (float) – weight for spn loss
spn_loss_max_epochs (int) – Max number of epochs

forward(predictions, targets, current_epoch)[source]

Computes the value of the loss function and updates stats

Parameters:

predictions (tuple) – model predictions
targets (tuple) – ground truth data
current_epoch (int) – The count of the current epoch.

Returns:

loss – the loss value

Return type:

torch.Tensor

speechbrain.lobes.models.FastSpeech2.mel_spectogram(sample_rate, hop_length, win_length, n_fft, n_mels, f_min, f_max, power, normalized, min_max_energy_norm, norm, mel_scale, compression, audio)[source]

calculates MelSpectrogram for a raw audio signal

Parameters:

sample_rate (int) – Sample rate of audio signal.
hop_length (int) – Length of hop between STFT windows.
win_length (int) – Window size.
n_fft (int) – Size of FFT.
n_mels (int) – Number of mel filterbanks.
f_min (float) – Minimum frequency.
f_max (float) – Maximum frequency.
power (float) – Exponent for the magnitude spectrogram.
normalized (bool) – Whether to normalize by magnitude after stft.
min_max_energy_norm (bool) – Whether to normalize by min-max
norm (str or None) – If “slaney”, divide the triangular mel weights by the width of the mel band
mel_scale (str) – Scale to use: “htk” or “slaney”.
compression (bool) – whether to do dynamic range compression
audio (torch.Tensor) – input audio signal

Returns:

mel (torch.Tensor)
rmse (torch.Tensor)

speechbrain.lobes.models.FastSpeech2.dynamic_range_compression(x, C=1, clip_val=1e-05)[source]: Dynamic range compression for audio signals

class speechbrain.lobes.models.FastSpeech2.SSIMLoss[source]

Bases: Module

SSIM loss as (1 - SSIM) SSIM is explained here https://en.wikipedia.org/wiki/Structural_similarity

sequence_mask(sequence_length, max_len=None)[source]

Create a sequence mask for filtering padding in a sequence tensor.

Parameters:

sequence_length (torch.Tensor) – Sequence lengths.
max_len (int) – Maximum sequence length. Defaults to None.

Returns:

mask

Return type:

[B, T_max]

sample_wise_min_max(x: Tensor, mask: Tensor)[source]

Min-Max normalize tensor through first dimension

Parameters:

x (torch.Tensor) – input tensor [B, D1, D2]
mask (torch.Tensor) – input mask [B, D1, 1]

Return type:

Normalized tensor

forward(y_hat, y, length)[source]

Parameters:

y_hat (torch.Tensor) – model prediction values [B, T, D].
y (torch.Tensor) – target values [B, T, D].
length (torch.Tensor) – length of each sample in a batch for masking.

Returns:

loss

Return type:

Average loss value in range [0, 1] masked by the length.

class speechbrain.lobes.models.FastSpeech2.TextMelCollateWithAlignment[source]

Bases: object

Zero-pads model inputs and targets based on number of frames per step result: tuple

a tuple of tensors to be used as inputs/targets (

text_padded, dur_padded, input_lengths, mel_padded, output_lengths, len_x, labels, wavs

)

__call__(batch)[source]

Collate’s training batch from normalized text and mel-spectrogram

Parameters:

batch (list) – [text_normalized, mel_normalized]

Returns:

phoneme_padded (torch.Tensor)
input_lengths (torch.Tensor)
mel_padded (torch.Tensor)
pitch_padded (torch.Tensor)
energy_padded (torch.Tensor)
output_lengths (torch.Tensor)
labels (torch.Tensor)
wavs (torch.Tensor)

speechbrain.lobes.models.FastSpeech2.maximum_path_numpy(value, mask)[source]

Monotonic alignment search algorithm, numpy works faster than the torch implementation.

Parameters:

value (torch.Tensor) – input alignment values [b, t_x, t_y]
mask (torch.Tensor) – input alignment mask [b, t_x, t_y]

Returns:

path

Return type:

torch.Tensor

Example

>>> import torch
>>> from speechbrain.lobes.models.FastSpeech2 import maximum_path_numpy
>>> alignment = torch.rand(2, 5, 100)
>>> mask = torch.ones(2, 5, 100)
>>> hard_alignments = maximum_path_numpy(alignment, mask)

class speechbrain.lobes.models.FastSpeech2.AlignmentNetwork(in_query_channels=80, in_key_channels=512, attn_channels=80, temperature=0.0005)[source]

Bases: Module

Learns the alignment between the input text and the spectrogram with Gaussian Attention.

query -> conv1d -> relu -> conv1d -> relu -> conv1d -> L2_dist -> softmax -> alignment key -> conv1d -> relu -> conv1d - - - - - - - - - - - -^

Parameters:

in_query_channels (int) – Number of channels in the query network. Defaults to 80.
in_key_channels (int) – Number of channels in the key network. Defaults to 512.
attn_channels (int) – Number of inner channels in the attention layers. Defaults to 80.
temperature (float) – Temperature for the softmax. Defaults to 0.0005.

Example

>>> import torch
>>> from speechbrain.lobes.models.FastSpeech2 import AlignmentNetwork
>>> aligner = AlignmentNetwork(
...     in_query_channels=80,
...     in_key_channels=512,
...     attn_channels=80,
...     temperature=0.0005,
... )
>>> phoneme_feats = torch.rand(2, 512, 20)
>>> mels = torch.rand(2, 80, 100)
>>> alignment_soft, alignment_logprob = aligner(mels, phoneme_feats, None, None)
>>> alignment_soft.shape, alignment_logprob.shape
(torch.Size([2, 1, 100, 20]), torch.Size([2, 1, 100, 20]))

forward(queries, keys, mask, attn_prior)[source]

Forward pass of the aligner encoder.

Parameters:

queries (torch.Tensor) – the query tensor [B, C, T_de]
keys (torch.Tensor) – the query tensor [B, C_emb, T_en]
mask (torch.Tensor) – the query mask[B, T_de]
attn_prior (torch.Tensor) – the prior attention tensor [B, 1, T_en, T_de]

Returns:

attn (torch.Tensor) – soft attention [B, 1, T_en, T_de]
attn_logp (torch.Tensor) – log probabilities [B, 1, T_en , T_de]

class speechbrain.lobes.models.FastSpeech2.FastSpeech2WithAlignment(enc_num_layers, enc_num_head, enc_d_model, enc_ffn_dim, enc_k_dim, enc_v_dim, enc_dropout, in_query_channels, in_key_channels, attn_channels, temperature, dec_num_layers, dec_num_head, dec_d_model, dec_ffn_dim, dec_k_dim, dec_v_dim, dec_dropout, normalize_before, ffn_type, ffn_cnn_kernel_size_list, n_char, n_mels, postnet_embedding_dim, postnet_kernel_size, postnet_n_convolutions, postnet_dropout, padding_idx, dur_pred_kernel_size, pitch_pred_kernel_size, energy_pred_kernel_size, variance_predictor_dropout)[source]

Bases: Module

The FastSpeech2 text-to-speech model with internal alignment. This class is the main entry point for the model, which is responsible for instantiating all submodules, which, in turn, manage the individual neural network layers. Certain parts are adopted from the following implementation: https://github.com/coqui-ai/TTS/blob/dev/TTS/tts/models/forward_tts.py

Simplified STRUCTURE: input -> token embedding -> encoder -> aligner -> duration/pitch/energy -> upsampler -> decoder -> output

Parameters:

enc_num_layers (int) – number of transformer layers (TransformerEncoderLayer) in encoder
enc_num_head (int) – number of multi-head-attention (MHA) heads in encoder transformer layers
enc_d_model (int) – the number of expected features in the encoder
enc_ffn_dim (int) – the dimension of the feedforward network model
enc_k_dim (int) – the dimension of the key
enc_v_dim (int) – the dimension of the value
enc_dropout (float) – Dropout for the encoder
in_query_channels (int) – Number of channels in the query network.
in_key_channels (int) – Number of channels in the key network.
attn_channels (int) – Number of inner channels in the attention layers.
temperature (float) – Temperature for the softmax.
dec_num_layers (int) – number of transformer layers (TransformerEncoderLayer) in decoder
dec_num_head (int) – number of multi-head-attention (MHA) heads in decoder transformer layers
dec_d_model (int) – the number of expected features in the decoder
dec_ffn_dim (int) – the dimension of the feedforward network model
dec_k_dim (int) – the dimension of the key
dec_v_dim (int) – the dimension of the value
dec_dropout (float) – dropout for the decoder
normalize_before (bool) – whether normalization should be applied before or after MHA or FFN in Transformer layers.
ffn_type (str) – whether to use convolutional layers instead of feed forward network inside transformer layer.
ffn_cnn_kernel_size_list (list of int) – conv kernel size of 2 1d-convs if ffn_type is 1dcnn
n_char (int) – the number of symbols for the token embedding
n_mels (int) – number of bins in mel spectrogram
postnet_embedding_dim (int) – output feature dimension for convolution layers
postnet_kernel_size (int) – postnet convolution kernel size
postnet_n_convolutions (int) – number of convolution layers
postnet_dropout (float) – dropout probability for postnet
padding_idx (int) – the index for padding
dur_pred_kernel_size (int) – the convolution kernel size in duration predictor
pitch_pred_kernel_size (int) – kernel size for pitch prediction.
energy_pred_kernel_size (int) – kernel size for energy prediction.
variance_predictor_dropout (float) – dropout probability for variance predictor (duration/pitch/energy)

Example

>>> import torch
>>> from speechbrain.lobes.models.FastSpeech2 import FastSpeech2WithAlignment
>>> model = FastSpeech2WithAlignment(
...    enc_num_layers=6,
...    enc_num_head=2,
...    enc_d_model=384,
...    enc_ffn_dim=1536,
...    enc_k_dim=384,
...    enc_v_dim=384,
...    enc_dropout=0.1,
...    in_query_channels=80,
...    in_key_channels=384,
...    attn_channels=80,
...    temperature=0.0005,
...    dec_num_layers=6,
...    dec_num_head=2,
...    dec_d_model=384,
...    dec_ffn_dim=1536,
...    dec_k_dim=384,
...    dec_v_dim=384,
...    dec_dropout=0.1,
...    normalize_before=False,
...    ffn_type='1dcnn',
...    ffn_cnn_kernel_size_list=[9, 1],
...    n_char=40,
...    n_mels=80,
...    postnet_embedding_dim=512,
...    postnet_kernel_size=5,
...    postnet_n_convolutions=5,
...    postnet_dropout=0.5,
...    padding_idx=0,
...    dur_pred_kernel_size=3,
...    pitch_pred_kernel_size=3,
...    energy_pred_kernel_size=3,
...    variance_predictor_dropout=0.5)
>>> inputs = torch.tensor([
...     [13, 12, 31, 14, 19],
...     [31, 16, 30, 31, 0],
... ])
>>> mels = torch.rand(2, 100, 80)
>>> mel_post, postnet_output, durations, predict_pitch, avg_pitch, predict_energy, avg_energy, mel_lens, alignment_durations, alignment_soft, alignment_logprob, alignment_mas = model(inputs, mels)
>>> mel_post.shape, durations.shape
(torch.Size([2, 100, 80]), torch.Size([2, 5]))
>>> predict_pitch.shape, predict_energy.shape
(torch.Size([2, 5, 1]), torch.Size([2, 5, 1]))
>>> alignment_soft.shape, alignment_mas.shape
(torch.Size([2, 100, 5]), torch.Size([2, 100, 5]))

forward(tokens, mel_spectograms=None, pitch=None, energy=None, pace=1.0, pitch_rate=1.0, energy_rate=1.0)[source]

forward pass for training and inference

Parameters:

tokens (torch.Tensor) – batch of input tokens
mel_spectograms (torch.Tensor) – batch of mel_spectograms (used only for training)
pitch (torch.Tensor) – batch of pitch for each frame. If it is None, the model will infer on predicted pitches
energy (torch.Tensor) – batch of energy for each frame. If it is None, the model will infer on predicted energies
pace (float) – scaling factor for durations
pitch_rate (float) – scaling factor for pitches
energy_rate (float) – scaling factor for energies

Returns:

mel_post (torch.Tensor) – mel outputs from the decoder
postnet_output (torch.Tensor) – mel outputs from the postnet
predict_durations (torch.Tensor) – predicted durations of each token
predict_pitch (torch.Tensor) – predicted pitches of each token
avg_pitch (torch.Tensor) – target pitches for each token if input pitch is not None None if input pitch is None
predict_energy (torch.Tensor) – predicted energies of each token
avg_energy (torch.Tensor) – target energies for each token if input energy is not None None if input energy is None
mel_length – predicted lengths of mel spectrograms
alignment_durations – durations from the hard alignment map
alignment_soft (torch.Tensor) – soft alignment potentials
alignment_logprob (torch.Tensor) – log scale alignment potentials
alignment_mas (torch.Tensor) – hard alignment map

class speechbrain.lobes.models.FastSpeech2.LossWithAlignment(log_scale_durations, ssim_loss_weight, duration_loss_weight, pitch_loss_weight, energy_loss_weight, mel_loss_weight, postnet_mel_loss_weight, aligner_loss_weight, binary_alignment_loss_weight, binary_alignment_loss_warmup_epochs, binary_alignment_loss_max_epochs)[source]

Bases: Module

Loss computation including internal aligner

Parameters:

log_scale_durations (bool) – applies logarithm to target durations
ssim_loss_weight (float) – weight for the ssim loss
duration_loss_weight (float) – weight for the duration loss
pitch_loss_weight (float) – weight for the pitch loss
energy_loss_weight (float) – weight for the energy loss
mel_loss_weight (float) – weight for the mel loss
postnet_mel_loss_weight (float) – weight for the postnet mel loss
aligner_loss_weight (float) – weight for the alignment loss
binary_alignment_loss_weight (float) – weight for the postnet mel loss
binary_alignment_loss_warmup_epochs (int) – Number of epochs to gradually increase the impact of binary loss.
binary_alignment_loss_max_epochs (int) – From this epoch on the impact of binary loss is ignored.

forward(predictions, targets, current_epoch)[source]

Computes the value of the loss function and updates stats

Parameters:

predictions (tuple) – model predictions
targets (tuple) – ground truth data
current_epoch (int) – used to determinate the start/end of the binary alignment loss

Returns:

loss – the loss value

Return type:

torch.Tensor

class speechbrain.lobes.models.FastSpeech2.ForwardSumLoss(blank_logprob=-1)[source]

Bases: Module

CTC alignment loss

Parameters:: blank_logprob (pad value)

Example

>>> import torch
>>> from speechbrain.lobes.models.FastSpeech2 import ForwardSumLoss
>>> loss_func = ForwardSumLoss()
>>> attn_logprob = torch.rand(2, 1, 100, 5)
>>> key_lens = torch.tensor([5, 5])
>>> query_lens = torch.tensor([100, 100])
>>> loss = loss_func(attn_logprob, key_lens, query_lens)

forward(attn_logprob, key_lens, query_lens)[source]

Parameters:

attn_logprob (torch.Tensor) – log scale alignment potentials [B, 1, query_lens, key_lens]
key_lens (torch.Tensor) – mel lengths
query_lens (torch.Tensor) – phoneme lengths

Returns:

total_loss

Return type:

torch.Tensor

class speechbrain.lobes.models.FastSpeech2.BinaryAlignmentLoss[source]

Bases: Module

Binary loss that forces soft alignments to match the hard alignments as explained in https://arxiv.org/pdf/2108.10447.pdf. .. rubric:: Example

>>> import torch
>>> from speechbrain.lobes.models.FastSpeech2 import BinaryAlignmentLoss
>>> loss_func = BinaryAlignmentLoss()
>>> alignment_hard = torch.randint(0, 2, (2, 100, 5))
>>> alignment_soft = torch.rand(2, 100, 5)
>>> loss = loss_func(alignment_hard, alignment_soft)

forward(alignment_hard, alignment_soft)[source]

alignment_hard: torch.Tensor: hard alignment map [B, mel_lens, phoneme_lens]
alignment_soft: torch.Tensor: soft alignment potentials [B, mel_lens, phoneme_lens]