speechbrain.lobes.models.HifiGAN module

Neural network modules for the HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

For more details: https://arxiv.org/pdf/2010.05646.pdf

Authors
  • Jarod Duret 2021

  • Yingzhi WANG 2022

Summary

Classes:

DiscriminatorLoss

Creates a summary of discriminator losses

DiscriminatorP

HiFiGAN Periodic Discriminator Takes every Pth value from the input waveform and applied a stack of convolutions. Note: if period is 2 waveform = [1, 2, 3, 4, 5, 6 ...] --> [1, 3, 5 ... ] --> convs -> score, feat.

DiscriminatorS

HiFiGAN Scale Discriminator.

GeneratorLoss

Creates a summary of generator losses and applies weights for different losses

HifiganDiscriminator

HiFiGAN discriminator wrapping MPD and MSD.

HifiganGenerator

HiFiGAN Generator with Multi-Receptive Field Fusion (MRF)

L1SpecLoss

L1 Loss over Spectrograms as described in HiFiGAN paper https://arxiv.org/pdf/2010.05646.pdf Note : L1 loss helps leaning details compared with L2 loss

MSEDLoss

Mean Squared Discriminator Loss The discriminator is trained to classify ground truth samples to 1, and the samples synthesized from the generator to 0.

MSEGLoss

Mean Squared Generator Loss The generator is trained to fake the discriminator by updating the sample quality to be classified to a value almost equal to 1.

MelganFeatureLoss

Calculates the feature matching loss, which is a learned similarity metric measured by the difference in features of the discriminator between a ground truth sample and a generated sample (Larsen et al., 2016, Kumar et al., 2019).

MultiPeriodDiscriminator

HiFiGAN Multi-Period Discriminator (MPD) Wrapper for the PeriodDiscriminator to apply it in different periods.

MultiScaleDiscriminator

HiFiGAN Multi-Scale Discriminator.

MultiScaleSTFTLoss

Multi-scale STFT loss.

ResBlock1

Residual Block Type 1, which has 3 convolutional layers in each convolution block.

ResBlock2

Residual Block Type 2, which has 2 convolutional layers in each convolution block.

STFTLoss

STFT loss.

UnitHifiganGenerator

Unit HiFiGAN Generator with Multi-Receptive Field Fusion (MRF)

VariancePredictor

Variance predictor inspired from FastSpeech2

Functions:

dynamic_range_compression

Dynamic range compression for audio signals

mel_spectogram

calculates MelSpectrogram for a raw audio signal

process_duration

Process a given batch of code to extract consecutive unique elements and their associated features.

stft

computes the Fourier transform of short overlapping windows of the input

Reference

speechbrain.lobes.models.HifiGAN.dynamic_range_compression(x, C=1, clip_val=1e-05)[source]

Dynamic range compression for audio signals

speechbrain.lobes.models.HifiGAN.mel_spectogram(sample_rate, hop_length, win_length, n_fft, n_mels, f_min, f_max, power, normalized, norm, mel_scale, compression, audio)[source]

calculates MelSpectrogram for a raw audio signal

Parameters:
  • sample_rate (int) – Sample rate of audio signal.

  • hop_length (int) – Length of hop between STFT windows.

  • win_length (int) – Window size.

  • n_fft (int) – Size of FFT.

  • n_mels (int) – Number of mel filterbanks.

  • f_min (float) – Minimum frequency.

  • f_max (float) – Maximum frequency.

  • power (float) – Exponent for the magnitude spectrogram.

  • normalized (bool) – Whether to normalize by magnitude after stft.

  • norm (str or None) – If “slaney”, divide the triangular mel weights by the width of the mel band

  • mel_scale (str) – Scale to use: “htk” or “slaney”.

  • compression (bool) – whether to do dynamic range compression

  • audio (torch.Tensor) – input audio signal

Returns:

mel – The mel spectrogram corresponding to the input audio.

Return type:

torch.Tensor

speechbrain.lobes.models.HifiGAN.process_duration(code, code_feat)[source]

Process a given batch of code to extract consecutive unique elements and their associated features.

Parameters:
  • code (torch.Tensor (batch, time)) – Tensor of code indices.

  • code_feat (torch.Tensor (batch, time, channel)) – Tensor of code features.

Returns:

  • uniq_code_feat_filtered (torch.Tensor (batch, time)) – Features of consecutive unique codes.

  • mask (torch.Tensor (batch, time)) – Padding mask for the unique codes.

  • uniq_code_count (torch.Tensor (n)) – Count of unique codes.

Example

>>> code = torch.IntTensor([[40, 18, 18, 10]])
>>> code_feat = torch.rand([1, 4, 128])
>>> out_tensor, mask, uniq_code = process_duration(code, code_feat)
>>> out_tensor.shape
torch.Size([1, 1, 128])
>>> mask.shape
torch.Size([1, 1])
>>> uniq_code.shape
torch.Size([1])
class speechbrain.lobes.models.HifiGAN.ResBlock1(channels, kernel_size=3, dilation=(1, 3, 5))[source]

Bases: Module

Residual Block Type 1, which has 3 convolutional layers in each convolution block.

Parameters:
  • channels (int) – number of hidden channels for the convolutional layers.

  • kernel_size (int) – size of the convolution filter in each layer.

  • dilation (tuple) – list of dilation value for each conv layer in a block.

forward(x)[source]

Returns the output of ResBlock1

Parameters:

x (torch.Tensor (batch, channel, time)) – input tensor.

Returns:

x – output of ResBlock1

Return type:

torch.Tensor

remove_weight_norm()[source]

This functions removes weight normalization during inference.

class speechbrain.lobes.models.HifiGAN.ResBlock2(channels, kernel_size=3, dilation=(1, 3))[source]

Bases: Module

Residual Block Type 2, which has 2 convolutional layers in each convolution block.

Parameters:
  • channels (int) – number of hidden channels for the convolutional layers.

  • kernel_size (int) – size of the convolution filter in each layer.

  • dilation (tuple) – list of dilation value for each conv layer in a block.

forward(x)[source]

Returns the output of ResBlock2

Parameters:

x (torch.Tensor (batch, channel, time)) – input tensor.

Returns:

x – output of ResBlock2

Return type:

torch.Tensor

remove_weight_norm()[source]

This functions removes weight normalization during inference.

class speechbrain.lobes.models.HifiGAN.HifiganGenerator(in_channels, out_channels, resblock_type, resblock_dilation_sizes, resblock_kernel_sizes, upsample_kernel_sizes, upsample_initial_channel, upsample_factors, inference_padding=5, cond_channels=0, conv_post_bias=True)[source]

Bases: Module

HiFiGAN Generator with Multi-Receptive Field Fusion (MRF)

Parameters:
  • in_channels (int) – number of input tensor channels.

  • out_channels (int) – number of output tensor channels.

  • resblock_type (str) – type of the ResBlock. ‘1’ or ‘2’.

  • resblock_dilation_sizes (List[List[int]]) – list of dilation values in each layer of a ResBlock.

  • resblock_kernel_sizes (List[int]) – list of kernel sizes for each ResBlock.

  • upsample_kernel_sizes (List[int]) – list of kernel sizes for each transposed convolution.

  • upsample_initial_channel (int) – number of channels for the first upsampling layer. This is divided by 2 for each consecutive upsampling layer.

  • upsample_factors (List[int]) – upsampling factors (stride) for each upsampling layer.

  • inference_padding (int) – constant padding applied to the input at inference time. Defaults to 5.

  • cond_channels (int) – Default 0

  • conv_post_bias (bool) – Default True

Example

>>> inp_tensor = torch.rand([4, 80, 33])
>>> hifigan_generator= HifiganGenerator(
...    in_channels = 80,
...    out_channels = 1,
...    resblock_type = "1",
...    resblock_dilation_sizes = [[1, 3, 5], [1, 3, 5], [1, 3, 5]],
...    resblock_kernel_sizes = [3, 7, 11],
...    upsample_kernel_sizes = [16, 16, 4, 4],
...    upsample_initial_channel = 512,
...    upsample_factors = [8, 8, 2, 2],
... )
>>> out_tensor = hifigan_generator(inp_tensor)
>>> out_tensor.shape
torch.Size([4, 1, 8448])
forward(x, g=None)[source]
Parameters:
  • x (torch.Tensor (batch, channel, time)) – feature input tensor.

  • g (torch.Tensor (batch, 1, time)) – global conditioning input tensor.

Returns:

o – The output tensor

Return type:

torch.Tensor

remove_weight_norm()[source]

This functions removes weight normalization during inference.

inference(c, padding=True)[source]

The inference function performs a padding and runs the forward method.

Parameters:
  • c (torch.Tensor (batch, channel, time)) – feature input tensor.

  • padding (bool) – Whether to apply padding before forward.

Return type:

See forward()

class speechbrain.lobes.models.HifiGAN.VariancePredictor(encoder_embed_dim, var_pred_hidden_dim, var_pred_kernel_size, var_pred_dropout)[source]

Bases: Module

Variance predictor inspired from FastSpeech2

Parameters:
  • encoder_embed_dim (int) – number of input tensor channels.

  • var_pred_hidden_dim (int) – size of hidden channels for the convolutional layers.

  • var_pred_kernel_size (int) – size of the convolution filter in each layer.

  • var_pred_dropout (float) – dropout probability of each layer.

Example

>>> inp_tensor = torch.rand([4, 80, 128])
>>> duration_predictor = VariancePredictor(
...    encoder_embed_dim = 128,
...    var_pred_hidden_dim = 128,
...    var_pred_kernel_size = 3,
...    var_pred_dropout = 0.5,
... )
>>> out_tensor = duration_predictor (inp_tensor)
>>> out_tensor.shape
torch.Size([4, 80])
forward(x)[source]
Parameters:

x (torch.Tensor (batch, channel, time)) – feature input tensor.

Return type:

Variance prediction

class speechbrain.lobes.models.HifiGAN.UnitHifiganGenerator(in_channels, out_channels, resblock_type, resblock_dilation_sizes, resblock_kernel_sizes, upsample_kernel_sizes, upsample_initial_channel, upsample_factors, inference_padding=5, cond_channels=0, conv_post_bias=True, num_embeddings=100, embedding_dim=128, duration_predictor=False, var_pred_hidden_dim=128, var_pred_kernel_size=3, var_pred_dropout=0.5)[source]

Bases: HifiganGenerator

Unit HiFiGAN Generator with Multi-Receptive Field Fusion (MRF)

Parameters:
  • in_channels (int) – number of input tensor channels.

  • out_channels (int) – number of output tensor channels.

  • resblock_type (str) – type of the ResBlock. ‘1’ or ‘2’.

  • resblock_dilation_sizes (List[List[int]]) – list of dilation values in each layer of a ResBlock.

  • resblock_kernel_sizes (List[int]) – list of kernel sizes for each ResBlock.

  • upsample_kernel_sizes (List[int]) – list of kernel sizes for each transposed convolution.

  • upsample_initial_channel (int) – number of channels for the first upsampling layer. This is divided by 2 for each consecutive upsampling layer.

  • upsample_factors (List[int]) – upsampling factors (stride) for each upsampling layer.

  • inference_padding (int) – constant padding applied to the input at inference time. Defaults to 5.

  • cond_channels (int) – Default 0

  • conv_post_bias (bool) – Default True

  • num_embeddings (int) – size of the dictionary of embeddings.

  • embedding_dim (int) – size of each embedding vector.

  • duration_predictor (bool) – enable duration predictor module.

  • var_pred_hidden_dim (int) – size of hidden channels for the convolutional layers of the duration predictor.

  • var_pred_kernel_size (int) – size of the convolution filter in each layer of the duration predictor.

  • var_pred_dropout (float) – dropout probability of each layer in the duration predictor.

Example

>>> inp_tensor = torch.randint(0, 100, (4, 10))
>>> unit_hifigan_generator= UnitHifiganGenerator(
...    in_channels = 128,
...    out_channels = 1,
...    resblock_type = "1",
...    resblock_dilation_sizes = [[1, 3, 5], [1, 3, 5], [1, 3, 5]],
...    resblock_kernel_sizes = [3, 7, 11],
...    upsample_kernel_sizes = [11, 8, 8, 4, 4],
...    upsample_initial_channel = 512,
...    upsample_factors = [5, 4, 4, 2, 2],
...    num_embeddings = 100,
...    embedding_dim = 128,
...    duration_predictor = True,
...    var_pred_hidden_dim = 128,
...    var_pred_kernel_size = 3,
...    var_pred_dropout = 0.5,
... )
>>> out_tensor, _ = unit_hifigan_generator(inp_tensor)
>>> out_tensor.shape
torch.Size([4, 1, 3200])
forward(x, g=None)[source]
Parameters:
  • x (torch.Tensor (batch, time)) – feature input tensor.

  • g (torch.Tensor (batch, 1, time)) – global conditioning input tensor.

Returns:

  • See parent forward()

  • tuple of log_dur_pred and log_dur

inference(x)[source]

The inference function performs duration prediction and runs the forward method.

Parameters:

x (torch.Tensor (batch, time)) – feature input tensor.

Return type:

See parent forward()

class speechbrain.lobes.models.HifiGAN.DiscriminatorP(period, kernel_size=5, stride=3)[source]

Bases: Module

HiFiGAN Periodic Discriminator Takes every Pth value from the input waveform and applied a stack of convolutions. Note:

if period is 2 waveform = [1, 2, 3, 4, 5, 6 …] –> [1, 3, 5 … ] –> convs -> score, feat

Parameters:
  • period (int) – Takes every Pth value from input

  • kernel_size (int) – The size of the convolution kernel

  • stride (int) – The stride of the convolution kernel

forward(x)[source]
Parameters:

x (torch.Tensor (batch, 1, time)) – input waveform.

Returns:

  • x (torch.Tensor)

  • feat (list)

class speechbrain.lobes.models.HifiGAN.MultiPeriodDiscriminator[source]

Bases: Module

HiFiGAN Multi-Period Discriminator (MPD) Wrapper for the PeriodDiscriminator to apply it in different periods. Periods are suggested to be prime numbers to reduce the overlap between each discriminator.

forward(x)[source]

Returns Multi-Period Discriminator scores and features

Parameters:

x (torch.Tensor (batch, 1, time)) – input waveform.

Returns:

  • scores (list)

  • feats (list)

class speechbrain.lobes.models.HifiGAN.DiscriminatorS(use_spectral_norm=False)[source]

Bases: Module

HiFiGAN Scale Discriminator. It is similar to MelganDiscriminator but with a specific architecture explained in the paper. SpeechBrain CNN wrappers are not used here because spectral_norm is not often used

Parameters:

use_spectral_norm (bool) – if True switch to spectral norm instead of weight norm.

forward(x)[source]
Parameters:

x (torch.Tensor (batch, 1, time)) – input waveform.

Returns:

  • x (torch.Tensor)

  • feat (list)

class speechbrain.lobes.models.HifiGAN.MultiScaleDiscriminator[source]

Bases: Module

HiFiGAN Multi-Scale Discriminator. Similar to MultiScaleMelganDiscriminator but specially tailored for HiFiGAN as in the paper.

forward(x)[source]
Parameters:

x (torch.Tensor (batch, 1, time)) – input waveform.

Returns:

  • scores (list)

  • feats (list)

class speechbrain.lobes.models.HifiGAN.HifiganDiscriminator[source]

Bases: Module

HiFiGAN discriminator wrapping MPD and MSD.

Example

>>> inp_tensor = torch.rand([4, 1, 8192])
>>> hifigan_discriminator= HifiganDiscriminator()
>>> scores, feats = hifigan_discriminator(inp_tensor)
>>> len(scores)
8
>>> len(feats)
8
forward(x)[source]

Returns list of list of features from each layer of each discriminator.

Parameters:

x (torch.Tensor) – input waveform.

Returns:

  • scores (list)

  • feats (list)

speechbrain.lobes.models.HifiGAN.stft(x, n_fft, hop_length, win_length, window_fn='hann_window')[source]

computes the Fourier transform of short overlapping windows of the input

class speechbrain.lobes.models.HifiGAN.STFTLoss(n_fft, hop_length, win_length)[source]

Bases: Module

STFT loss. Input generate and real waveforms are converted to spectrograms compared with L1 and Spectral convergence losses. It is from ParallelWaveGAN paper https://arxiv.org/pdf/1910.11480.pdf

Parameters:
  • n_fft (int) – size of Fourier transform.

  • hop_length (int) – the distance between neighboring sliding window frames.

  • win_length (int) – the size of window frame and STFT filter.

forward(y_hat, y)[source]

Returns magnitude loss and spectral convergence loss

Parameters:
Returns:

  • loss_mag (torch.Tensor) – Magnitude loss

  • loss_sc (torch.Tensor) – Spectral convergence loss

class speechbrain.lobes.models.HifiGAN.MultiScaleSTFTLoss(n_ffts=(1024, 2048, 512), hop_lengths=(120, 240, 50), win_lengths=(600, 1200, 240))[source]

Bases: Module

Multi-scale STFT loss. Input generate and real waveforms are converted to spectrograms compared with L1 and Spectral convergence losses. It is from ParallelWaveGAN paper https://arxiv.org/pdf/1910.11480.pdf

forward(y_hat, y)[source]

Returns multi-scale magnitude loss and spectral convergence loss

Parameters:
Returns:

  • loss_mag (torch.Tensor) – Magnitude loss

  • loss_sc (torch.Tensor) – Spectral convergence loss

class speechbrain.lobes.models.HifiGAN.L1SpecLoss(sample_rate=22050, hop_length=256, win_length=24, n_mel_channels=80, n_fft=1024, n_stft=513, mel_fmin=0.0, mel_fmax=8000.0, mel_normalized=False, power=1.0, norm='slaney', mel_scale='slaney', dynamic_range_compression=True)[source]

Bases: Module

L1 Loss over Spectrograms as described in HiFiGAN paper https://arxiv.org/pdf/2010.05646.pdf Note : L1 loss helps leaning details compared with L2 loss

Parameters:
  • sample_rate (int) – Sample rate of audio signal.

  • hop_length (int) – Length of hop between STFT windows.

  • win_length (int) – Window size.

  • n_mel_channels (int) – Number of mel filterbanks.

  • n_fft (int) – Size of FFT.

  • n_stft (int) – Size of STFT.

  • mel_fmin (float) – Minimum frequency.

  • mel_fmax (float) – Maximum frequency.

  • mel_normalized (bool) – Whether to normalize by magnitude after stft.

  • power (float) – Exponent for the magnitude spectrogram.

  • norm (str or None) – If “slaney”, divide the triangular mel weights by the width of the mel band

  • mel_scale (str) – Scale to use: “htk” or “slaney”.

  • dynamic_range_compression (bool) – whether to do dynamic range compression

forward(y_hat, y)[source]

Returns L1 Loss over Spectrograms

Parameters:
Returns:

loss_mag – L1 loss

Return type:

torch.Tensor

class speechbrain.lobes.models.HifiGAN.MSEGLoss(*args, **kwargs)[source]

Bases: Module

Mean Squared Generator Loss The generator is trained to fake the discriminator by updating the sample quality to be classified to a value almost equal to 1.

forward(score_fake)[source]

Returns Generator GAN loss

Parameters:

score_fake (list) – discriminator scores of generated waveforms D(G(s))

Returns:

loss_fake – Generator loss

Return type:

torch.Tensor

class speechbrain.lobes.models.HifiGAN.MelganFeatureLoss[source]

Bases: Module

Calculates the feature matching loss, which is a learned similarity metric measured by the difference in features of the discriminator between a ground truth sample and a generated sample (Larsen et al., 2016, Kumar et al., 2019).

forward(fake_feats, real_feats)[source]

Returns feature matching loss

Parameters:
  • fake_feats (list) – discriminator features of generated waveforms

  • real_feats (list) – discriminator features of groundtruth waveforms

Returns:

loss_feats – Feature matching loss

Return type:

torch.Tensor

class speechbrain.lobes.models.HifiGAN.MSEDLoss[source]

Bases: Module

Mean Squared Discriminator Loss The discriminator is trained to classify ground truth samples to 1, and the samples synthesized from the generator to 0.

forward(score_fake, score_real)[source]

Returns Discriminator GAN losses

Parameters:
  • score_fake (list) – discriminator scores of generated waveforms

  • score_real (list) – discriminator scores of groundtruth waveforms

Returns:

  • loss_d (torch.Tensor) – The total discriminator loss

  • loss_real (torch.Tensor) – The loss on real samples

  • loss_fake (torch.Tensor) – The loss on fake samples

class speechbrain.lobes.models.HifiGAN.GeneratorLoss(stft_loss=None, stft_loss_weight=0, mseg_loss=None, mseg_loss_weight=0, feat_match_loss=None, feat_match_loss_weight=0, l1_spec_loss=None, l1_spec_loss_weight=0, mseg_dur_loss=None, mseg_dur_loss_weight=0)[source]

Bases: Module

Creates a summary of generator losses and applies weights for different losses

Parameters:
  • stft_loss (object) – object of stft loss

  • stft_loss_weight (float) – weight of STFT loss

  • mseg_loss (object) – object of mseg loss

  • mseg_loss_weight (float) – weight of mseg loss

  • feat_match_loss (object) – object of feature match loss

  • feat_match_loss_weight (float) – weight of feature match loss

  • l1_spec_loss (object) – object of L1 spectrogram loss

  • l1_spec_loss_weight (float) – weight of L1 spectrogram loss

  • mseg_dur_loss (object) – object of duration loss

  • mseg_dur_loss_weight (float) – weight of duration loss

forward(stage, y_hat=None, y=None, scores_fake=None, feats_fake=None, feats_real=None, log_dur_pred=None, log_dur=None)[source]

Returns a dictionary of generator losses and applies weights

Parameters:
  • stage (sb.Stage) – Either TRAIN or VALID or TEST

  • y_hat (torch.Tensor) – generated waveform tensor

  • y (torch.Tensor) – real waveform tensor

  • scores_fake (list) – discriminator scores of generated waveforms

  • feats_fake (list) – discriminator features of generated waveforms

  • feats_real (list) – discriminator features of groundtruth waveforms

  • log_dur_pred (torch.Tensor) – Predicted duration

  • log_dur (torch.Tensor) – Actual duration

Returns:

loss – The generator losses.

Return type:

dict

class speechbrain.lobes.models.HifiGAN.DiscriminatorLoss(msed_loss=None)[source]

Bases: Module

Creates a summary of discriminator losses

Parameters:

msed_loss (object) – object of MSE discriminator loss

forward(scores_fake, scores_real)[source]

Returns a dictionary of discriminator losses

Parameters:
  • scores_fake (list) – discriminator scores of generated waveforms

  • scores_real (list) – discriminator scores of groundtruth waveforms

Returns:

loss

Contains the keys:

”D_mse_gan_loss” “D_mse_gan_real_loss” “D_mse_gan_fake_loss” “D_loss”

Return type:

dict