speechbrain.lobes.models.HifiGAN module

Neural network modules for the HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

For more details: https://arxiv.org/pdf/2010.05646.pdf

Authors

Jarod Duret 2021
Yingzhi WANG 2022

Summary

Classes:

`DiscriminatorLoss`	Creates a summary of discriminator losses
`DiscriminatorP`	HiFiGAN Periodic Discriminator Takes every Pth value from the input waveform and applied a stack of convolutions. Note: if period is 2 waveform = [1, 2, 3, 4, 5, 6 ...] --> [1, 3, 5 ... ] --> convs -> score, feat.
`DiscriminatorS`	HiFiGAN Scale Discriminator.
`GeneratorLoss`	Creates a summary of generator losses and applies weights for different losses
`HifiganDiscriminator`	HiFiGAN discriminator wrapping MPD and MSD.
`HifiganGenerator`	HiFiGAN Generator with Multi-Receptive Field Fusion (MRF)
`L1SpecLoss`	L1 Loss over Spectrograms as described in HiFiGAN paper https://arxiv.org/pdf/2010.05646.pdf Note : L1 loss helps leaning details compared with L2 loss
`MSEDLoss`	Mean Squared Discriminator Loss The discriminator is trained to classify ground truth samples to 1, and the samples synthesized from the generator to 0.
`MSEGLoss`	Mean Squared Generator Loss The generator is trained to fake the discriminator by updating the sample quality to be classified to a value almost equal to 1.
`MelganFeatureLoss`	Calculates the feature matching loss, which is a learned similarity metric measured by the difference in features of the discriminator between a ground truth sample and a generated sample (Larsen et al., 2016, Kumar et al., 2019).
`MultiPeriodDiscriminator`	HiFiGAN Multi-Period Discriminator (MPD) Wrapper for the `PeriodDiscriminator` to apply it in different periods.
`MultiScaleDiscriminator`	HiFiGAN Multi-Scale Discriminator.
`MultiScaleSTFTLoss`	Multi-scale STFT loss.
`ResBlock1`	Residual Block Type 1, which has 3 convolutional layers in each convolution block.
`ResBlock2`	Residual Block Type 2, which has 2 convolutional layers in each convolution block.
`STFTLoss`	STFT loss.
`UnitHifiganGenerator`	Unit HiFiGAN Generator with Multi-Receptive Field Fusion (MRF)
`VariancePredictor`	Variance predictor inspired from FastSpeech2

Functions:

`dynamic_range_compression`	Dynamic range compression for audio signals
`mel_spectogram`	calculates MelSpectrogram for a raw audio signal
`process_duration`	Process a given batch of code to extract consecutive unique elements and their associated features.
`stft`	computes the Fourier transform of short overlapping windows of the input

Reference

speechbrain.lobes.models.HifiGAN.dynamic_range_compression(x, C=1, clip_val=1e-05)[source]: Dynamic range compression for audio signals

speechbrain.lobes.models.HifiGAN.mel_spectogram(sample_rate, hop_length, win_length, n_fft, n_mels, f_min, f_max, power, normalized, norm, mel_scale, compression, audio)[source]

calculates MelSpectrogram for a raw audio signal

Parameters:

sample_rate (int) – Sample rate of audio signal.
hop_length (int) – Length of hop between STFT windows.
win_length (int) – Window size.
n_fft (int) – Size of FFT.
n_mels (int) – Number of mel filterbanks.
f_min (float) – Minimum frequency.
f_max (float) – Maximum frequency.
power (float) – Exponent for the magnitude spectrogram.
normalized (bool) – Whether to normalize by magnitude after stft.
norm (str or None) – If “slaney”, divide the triangular mel weights by the width of the mel band
mel_scale (str) – Scale to use: “htk” or “slaney”.
compression (bool) – whether to do dynamic range compression
audio (torch.Tensor) – input audio signal

Returns:

mel – The mel spectrogram corresponding to the input audio.

Return type:

torch.Tensor

speechbrain.lobes.models.HifiGAN.process_duration(code, code_feat)[source]

Process a given batch of code to extract consecutive unique elements and their associated features.

Parameters:

code (torch.Tensor (batch, time)) – Tensor of code indices.
code_feat (torch.Tensor (batch, time, channel)) – Tensor of code features.

Returns:

uniq_code_feat_filtered (torch.Tensor (batch, time)) – Features of consecutive unique codes.
mask (torch.Tensor (batch, time)) – Padding mask for the unique codes.
uniq_code_count (torch.Tensor (n)) – Count of unique codes.

Example

>>> code = torch.IntTensor([[40, 18, 18, 10]])
>>> code_feat = torch.rand([1, 4, 128])
>>> out_tensor, mask, uniq_code = process_duration(code, code_feat)
>>> out_tensor.shape
torch.Size([1, 1, 128])
>>> mask.shape
torch.Size([1, 1])
>>> uniq_code.shape
torch.Size([1])

class speechbrain.lobes.models.HifiGAN.ResBlock1(channels, kernel_size=3, dilation=(1, 3, 5))[source]

Bases: Module

Residual Block Type 1, which has 3 convolutional layers in each convolution block.

Parameters:

channels (int) – number of hidden channels for the convolutional layers.
kernel_size (int) – size of the convolution filter in each layer.
dilation (tuple) – list of dilation value for each conv layer in a block.

forward(x)[source]

Returns the output of ResBlock1

Parameters:: x (torch.Tensor (batch, channel, time)) – input tensor.
Returns:: x – output of ResBlock1
Return type:: torch.Tensor

remove_weight_norm()[source]: This functions removes weight normalization during inference.

class speechbrain.lobes.models.HifiGAN.ResBlock2(channels, kernel_size=3, dilation=(1, 3))[source]

Bases: Module

Residual Block Type 2, which has 2 convolutional layers in each convolution block.

Parameters:

channels (int) – number of hidden channels for the convolutional layers.
kernel_size (int) – size of the convolution filter in each layer.
dilation (tuple) – list of dilation value for each conv layer in a block.

forward(x)[source]

Returns the output of ResBlock2

Parameters:: x (torch.Tensor (batch, channel, time)) – input tensor.
Returns:: x – output of ResBlock2
Return type:: torch.Tensor

remove_weight_norm()[source]: This functions removes weight normalization during inference.

class speechbrain.lobes.models.HifiGAN.HifiganGenerator(in_channels, out_channels, resblock_type, resblock_dilation_sizes, resblock_kernel_sizes, upsample_kernel_sizes, upsample_initial_channel, upsample_factors, inference_padding=5, cond_channels=0, conv_post_bias=True)[source]

Bases: Module

HiFiGAN Generator with Multi-Receptive Field Fusion (MRF)

Parameters:

in_channels (int) – number of input tensor channels.
out_channels (int) – number of output tensor channels.
resblock_type (str) – type of the ResBlock. ‘1’ or ‘2’.
resblock_dilation_sizes (List[List[int]]) – list of dilation values in each layer of a ResBlock.
resblock_kernel_sizes (List[int]) – list of kernel sizes for each ResBlock.
upsample_kernel_sizes (List[int]) – list of kernel sizes for each transposed convolution.
upsample_initial_channel (int) – number of channels for the first upsampling layer. This is divided by 2 for each consecutive upsampling layer.
upsample_factors (List[int]) – upsampling factors (stride) for each upsampling layer.
inference_padding (int) – constant padding applied to the input at inference time. Defaults to 5.
cond_channels (int) – Default 0
conv_post_bias (bool) – Default True

Example

>>> inp_tensor = torch.rand([4, 80, 33])
>>> hifigan_generator= HifiganGenerator(
...    in_channels = 80,
...    out_channels = 1,
...    resblock_type = "1",
...    resblock_dilation_sizes = [[1, 3, 5], [1, 3, 5], [1, 3, 5]],
...    resblock_kernel_sizes = [3, 7, 11],
...    upsample_kernel_sizes = [16, 16, 4, 4],
...    upsample_initial_channel = 512,
...    upsample_factors = [8, 8, 2, 2],
... )
>>> out_tensor = hifigan_generator(inp_tensor)
>>> out_tensor.shape
torch.Size([4, 1, 8448])

forward(x, g=None)[source]

Parameters:

x (torch.Tensor (batch, channel, time)) – feature input tensor.
g (torch.Tensor (batch, 1, time)) – global conditioning input tensor.

Returns:

o – The output tensor

Return type:

torch.Tensor

remove_weight_norm()[source]: This functions removes weight normalization during inference.

inference(c, padding=True)[source]

The inference function performs a padding and runs the forward method.

Parameters:

c (torch.Tensor (batch, channel, time)) – feature input tensor.
padding (bool) – Whether to apply padding before forward.

Return type:

See forward()

class speechbrain.lobes.models.HifiGAN.VariancePredictor(encoder_embed_dim, var_pred_hidden_dim, var_pred_kernel_size, var_pred_dropout)[source]

Bases: Module

Variance predictor inspired from FastSpeech2

Parameters:

encoder_embed_dim (int) – number of input tensor channels.
var_pred_hidden_dim (int) – size of hidden channels for the convolutional layers.
var_pred_kernel_size (int) – size of the convolution filter in each layer.
var_pred_dropout (float) – dropout probability of each layer.

Example

>>> inp_tensor = torch.rand([4, 80, 128])
>>> duration_predictor = VariancePredictor(
...    encoder_embed_dim = 128,
...    var_pred_hidden_dim = 128,
...    var_pred_kernel_size = 3,
...    var_pred_dropout = 0.5,
... )
>>> out_tensor = duration_predictor (inp_tensor)
>>> out_tensor.shape
torch.Size([4, 80])

forward(x)[source]

Parameters:: x (torch.Tensor (batch, channel, time)) – feature input tensor.
Return type:: Variance prediction

class speechbrain.lobes.models.HifiGAN.UnitHifiganGenerator(in_channels, out_channels, resblock_type, resblock_dilation_sizes, resblock_kernel_sizes, upsample_kernel_sizes, upsample_initial_channel, upsample_factors, inference_padding=5, cond_channels=0, conv_post_bias=True, num_embeddings=100, embedding_dim=128, duration_predictor=False, var_pred_hidden_dim=128, var_pred_kernel_size=3, var_pred_dropout=0.5)[source]

Bases: HifiganGenerator

Unit HiFiGAN Generator with Multi-Receptive Field Fusion (MRF)

Parameters:

in_channels (int) – number of input tensor channels.
out_channels (int) – number of output tensor channels.
resblock_type (str) – type of the ResBlock. ‘1’ or ‘2’.
resblock_dilation_sizes (List[List[int]]) – list of dilation values in each layer of a ResBlock.
resblock_kernel_sizes (List[int]) – list of kernel sizes for each ResBlock.
upsample_kernel_sizes (List[int]) – list of kernel sizes for each transposed convolution.
upsample_initial_channel (int) – number of channels for the first upsampling layer. This is divided by 2 for each consecutive upsampling layer.
upsample_factors (List[int]) – upsampling factors (stride) for each upsampling layer.
inference_padding (int) – constant padding applied to the input at inference time. Defaults to 5.
cond_channels (int) – Default 0
conv_post_bias (bool) – Default True
num_embeddings (int) – size of the dictionary of embeddings.
embedding_dim (int) – size of each embedding vector.
duration_predictor (bool) – enable duration predictor module.
var_pred_hidden_dim (int) – size of hidden channels for the convolutional layers of the duration predictor.
var_pred_kernel_size (int) – size of the convolution filter in each layer of the duration predictor.
var_pred_dropout (float) – dropout probability of each layer in the duration predictor.

Example

>>> inp_tensor = torch.randint(0, 100, (4, 10))
>>> unit_hifigan_generator= UnitHifiganGenerator(
...    in_channels = 128,
...    out_channels = 1,
...    resblock_type = "1",
...    resblock_dilation_sizes = [[1, 3, 5], [1, 3, 5], [1, 3, 5]],
...    resblock_kernel_sizes = [3, 7, 11],
...    upsample_kernel_sizes = [11, 8, 8, 4, 4],
...    upsample_initial_channel = 512,
...    upsample_factors = [5, 4, 4, 2, 2],
...    num_embeddings = 100,
...    embedding_dim = 128,
...    duration_predictor = True,
...    var_pred_hidden_dim = 128,
...    var_pred_kernel_size = 3,
...    var_pred_dropout = 0.5,
... )
>>> out_tensor, _ = unit_hifigan_generator(inp_tensor)
>>> out_tensor.shape
torch.Size([4, 1, 3200])

forward(x, g=None)[source]

Parameters:

x (torch.Tensor (batch, time)) – feature input tensor.
g (torch.Tensor (batch, 1, time)) – global conditioning input tensor.

Returns:

See parent forward()
tuple of log_dur_pred and log_dur

inference(x)[source]

The inference function performs duration prediction and runs the forward method.

Parameters:: x (torch.Tensor (batch, time)) – feature input tensor.
Return type:: See parent forward()

class speechbrain.lobes.models.HifiGAN.DiscriminatorP(period, kernel_size=5, stride=3)[source]

Bases: Module

HiFiGAN Periodic Discriminator Takes every Pth value from the input waveform and applied a stack of convolutions. Note:

if period is 2 waveform = [1, 2, 3, 4, 5, 6 …] –> [1, 3, 5 … ] –> convs -> score, feat

Parameters:

period (int) – Takes every Pth value from input
kernel_size (int) – The size of the convolution kernel
stride (int) – The stride of the convolution kernel

forward(x)[source]

Parameters:

x (torch.Tensor (batch, 1, time)) – input waveform.

Returns:

x (torch.Tensor)
feat (list)

class speechbrain.lobes.models.HifiGAN.MultiPeriodDiscriminator[source]

Bases: Module

HiFiGAN Multi-Period Discriminator (MPD) Wrapper for the PeriodDiscriminator to apply it in different periods. Periods are suggested to be prime numbers to reduce the overlap between each discriminator.

forward(x)[source]

Returns Multi-Period Discriminator scores and features

Parameters:

x (torch.Tensor (batch, 1, time)) – input waveform.

Returns:

scores (list)
feats (list)

class speechbrain.lobes.models.HifiGAN.DiscriminatorS(use_spectral_norm=False)[source]

Bases: Module

HiFiGAN Scale Discriminator. It is similar to MelganDiscriminator but with a specific architecture explained in the paper. SpeechBrain CNN wrappers are not used here because spectral_norm is not often used

Parameters:: use_spectral_norm (bool) – if True switch to spectral norm instead of weight norm.

forward(x)[source]

Parameters:

x (torch.Tensor (batch, 1, time)) – input waveform.

Returns:

x (torch.Tensor)
feat (list)

class speechbrain.lobes.models.HifiGAN.MultiScaleDiscriminator[source]

Bases: Module

HiFiGAN Multi-Scale Discriminator. Similar to MultiScaleMelganDiscriminator but specially tailored for HiFiGAN as in the paper.

forward(x)[source]

Parameters:

x (torch.Tensor (batch, 1, time)) – input waveform.

Returns:

scores (list)
feats (list)

class speechbrain.lobes.models.HifiGAN.HifiganDiscriminator[source]

Bases: Module

HiFiGAN discriminator wrapping MPD and MSD.

Example

>>> inp_tensor = torch.rand([4, 1, 8192])
>>> hifigan_discriminator= HifiganDiscriminator()
>>> scores, feats = hifigan_discriminator(inp_tensor)
>>> len(scores)
8
>>> len(feats)
8

forward(x)[source]

Returns list of list of features from each layer of each discriminator.

Parameters:

x (torch.Tensor) – input waveform.

Returns:

scores (list)
feats (list)

speechbrain.lobes.models.HifiGAN.stft(x, n_fft, hop_length, win_length, window_fn='hann_window')[source]: computes the Fourier transform of short overlapping windows of the input

class speechbrain.lobes.models.HifiGAN.STFTLoss(n_fft, hop_length, win_length)[source]

Bases: Module

STFT loss. Input generate and real waveforms are converted to spectrograms compared with L1 and Spectral convergence losses. It is from ParallelWaveGAN paper https://arxiv.org/pdf/1910.11480.pdf

Parameters:

n_fft (int) – size of Fourier transform.
hop_length (int) – the distance between neighboring sliding window frames.
win_length (int) – the size of window frame and STFT filter.

forward(y_hat, y)[source]

Returns magnitude loss and spectral convergence loss

Parameters:

y_hat (torch.Tensor) – generated waveform tensor
y (torch.Tensor) – real waveform tensor

Returns:

loss_mag (torch.Tensor) – Magnitude loss
loss_sc (torch.Tensor) – Spectral convergence loss

class speechbrain.lobes.models.HifiGAN.MultiScaleSTFTLoss(n_ffts=(1024, 2048, 512), hop_lengths=(120, 240, 50), win_lengths=(600, 1200, 240))[source]

Bases: Module

Multi-scale STFT loss. Input generate and real waveforms are converted to spectrograms compared with L1 and Spectral convergence losses. It is from ParallelWaveGAN paper https://arxiv.org/pdf/1910.11480.pdf

forward(y_hat, y)[source]

Returns multi-scale magnitude loss and spectral convergence loss

Parameters:

y_hat (torch.Tensor) – generated waveform tensor
y (torch.Tensor) – real waveform tensor

Returns:

loss_mag (torch.Tensor) – Magnitude loss
loss_sc (torch.Tensor) – Spectral convergence loss

class speechbrain.lobes.models.HifiGAN.L1SpecLoss(sample_rate=22050, hop_length=256, win_length=24, n_mel_channels=80, n_fft=1024, n_stft=513, mel_fmin=0.0, mel_fmax=8000.0, mel_normalized=False, power=1.0, norm='slaney', mel_scale='slaney', dynamic_range_compression=True)[source]

Bases: Module

L1 Loss over Spectrograms as described in HiFiGAN paper https://arxiv.org/pdf/2010.05646.pdf Note : L1 loss helps leaning details compared with L2 loss

Parameters:

sample_rate (int) – Sample rate of audio signal.
hop_length (int) – Length of hop between STFT windows.
win_length (int) – Window size.
n_mel_channels (int) – Number of mel filterbanks.
n_fft (int) – Size of FFT.
n_stft (int) – Size of STFT.
mel_fmin (float) – Minimum frequency.
mel_fmax (float) – Maximum frequency.
mel_normalized (bool) – Whether to normalize by magnitude after stft.
power (float) – Exponent for the magnitude spectrogram.
norm (str or None) – If “slaney”, divide the triangular mel weights by the width of the mel band
mel_scale (str) – Scale to use: “htk” or “slaney”.
dynamic_range_compression (bool) – whether to do dynamic range compression

forward(y_hat, y)[source]

Returns L1 Loss over Spectrograms

Parameters:

y_hat (torch.Tensor) – generated waveform tensor
y (torch.Tensor) – real waveform tensor

Returns:

loss_mag – L1 loss

Return type:

torch.Tensor

class speechbrain.lobes.models.HifiGAN.MSEGLoss(*args, **kwargs)[source]

Bases: Module

Mean Squared Generator Loss The generator is trained to fake the discriminator by updating the sample quality to be classified to a value almost equal to 1.

forward(score_fake)[source]

Returns Generator GAN loss

Parameters:: score_fake (list) – discriminator scores of generated waveforms D(G(s))
Returns:: loss_fake – Generator loss
Return type:: torch.Tensor

class speechbrain.lobes.models.HifiGAN.MelganFeatureLoss[source]

Bases: Module

Calculates the feature matching loss, which is a learned similarity metric measured by the difference in features of the discriminator between a ground truth sample and a generated sample (Larsen et al., 2016, Kumar et al., 2019).

forward(fake_feats, real_feats)[source]

Returns feature matching loss

Parameters:

fake_feats (list) – discriminator features of generated waveforms
real_feats (list) – discriminator features of groundtruth waveforms

Returns:

loss_feats – Feature matching loss

Return type:

torch.Tensor

class speechbrain.lobes.models.HifiGAN.MSEDLoss[source]

Bases: Module

Mean Squared Discriminator Loss The discriminator is trained to classify ground truth samples to 1, and the samples synthesized from the generator to 0.

forward(score_fake, score_real)[source]

Returns Discriminator GAN losses

Parameters:

score_fake (list) – discriminator scores of generated waveforms
score_real (list) – discriminator scores of groundtruth waveforms

Returns:

loss_d (torch.Tensor) – The total discriminator loss
loss_real (torch.Tensor) – The loss on real samples
loss_fake (torch.Tensor) – The loss on fake samples

class speechbrain.lobes.models.HifiGAN.GeneratorLoss(stft_loss=None, stft_loss_weight=0, mseg_loss=None, mseg_loss_weight=0, feat_match_loss=None, feat_match_loss_weight=0, l1_spec_loss=None, l1_spec_loss_weight=0, mseg_dur_loss=None, mseg_dur_loss_weight=0)[source]

Bases: Module

Creates a summary of generator losses and applies weights for different losses

Parameters:

stft_loss (object) – object of stft loss
stft_loss_weight (float) – weight of STFT loss
mseg_loss (object) – object of mseg loss
mseg_loss_weight (float) – weight of mseg loss
feat_match_loss (object) – object of feature match loss
feat_match_loss_weight (float) – weight of feature match loss
l1_spec_loss (object) – object of L1 spectrogram loss
l1_spec_loss_weight (float) – weight of L1 spectrogram loss
mseg_dur_loss (object) – object of duration loss
mseg_dur_loss_weight (float) – weight of duration loss

forward(stage, y_hat=None, y=None, scores_fake=None, feats_fake=None, feats_real=None, log_dur_pred=None, log_dur=None)[source]

Returns a dictionary of generator losses and applies weights

Parameters:

stage (sb.Stage) – Either TRAIN or VALID or TEST
y_hat (torch.Tensor) – generated waveform tensor
y (torch.Tensor) – real waveform tensor
scores_fake (list) – discriminator scores of generated waveforms
feats_fake (list) – discriminator features of generated waveforms
feats_real (list) – discriminator features of groundtruth waveforms
log_dur_pred (torch.Tensor) – Predicted duration
log_dur (torch.Tensor) – Actual duration

Returns:

loss – The generator losses.

Return type:

dict

class speechbrain.lobes.models.HifiGAN.DiscriminatorLoss(msed_loss=None)[source]

Bases: Module

Creates a summary of discriminator losses

Parameters:: msed_loss (object) – object of MSE discriminator loss

forward(scores_fake, scores_real)[source]

Returns a dictionary of discriminator losses

Parameters:

scores_fake (list) – discriminator scores of generated waveforms
scores_real (list) – discriminator scores of groundtruth waveforms

Returns:

loss –

Contains the keys:: ”D_mse_gan_loss” “D_mse_gan_real_loss” “D_mse_gan_fake_loss” “D_loss”

Return type:

dict