speechbrain.lobes.models.DiffWave module

Neural network modules for DIFFWAVE: A VERSATILE DIFFUSION MODEL FOR AUDIO SYNTHESIS

For more details: https://arxiv.org/pdf/2009.09761.pdf

Authors
  • Yingzhi WANG 2022

Summary

Classes:

DiffWave

DiffWave Model with dilated residual blocks

DiffWaveDiffusion

An enhanced diffusion implementation with DiffWave-specific inference

DiffusionEmbedding

Embeds the diffusion step into an input vector of DiffWave

ResidualBlock

Residual Block with dilated convolution

SpectrogramUpsampler

Upsampler for spectrograms with Transposed Conv Only the upsampling is done here, the layer-specific Conv can be found in residual block to map the mel bands into 2× residual channels

Functions:

diffwave_mel_spectogram

calculates MelSpectrogram for a raw audio signal and preprocesses it for diffwave training

Reference

speechbrain.lobes.models.DiffWave.diffwave_mel_spectogram(sample_rate, hop_length, win_length, n_fft, n_mels, f_min, f_max, power, normalized, norm, mel_scale, audio)[source]

calculates MelSpectrogram for a raw audio signal and preprocesses it for diffwave training

Parameters:
  • sample_rate (int) – Sample rate of audio signal.

  • hop_length (int) – Length of hop between STFT windows.

  • win_length (int) – Window size.

  • n_fft (int) – Size of FFT.

  • n_mels (int) – Number of mel filterbanks.

  • f_min (float) – Minimum frequency.

  • f_max (float) – Maximum frequency.

  • power (float) – Exponent for the magnitude spectrogram.

  • normalized (bool) – Whether to normalize by magnitude after stft.

  • norm (str or None) – If “slaney”, divide the triangular mel weights by the width of the mel band

  • mel_scale (str) – Scale to use: “htk” or “slaney”.

  • audio (torch.tensor) – input audio signal

Returns:

mel

Return type:

torch.Tensor

class speechbrain.lobes.models.DiffWave.DiffusionEmbedding(max_steps)[source]

Bases: Module

Embeds the diffusion step into an input vector of DiffWave

Parameters:

max_steps (int) – total diffusion steps

Example

>>> from speechbrain.lobes.models.DiffWave import DiffusionEmbedding
>>> diffusion_embedding = DiffusionEmbedding(max_steps=50)
>>> time_step = torch.randint(50, (1,))
>>> step_embedding = diffusion_embedding(time_step)
>>> step_embedding.shape
torch.Size([1, 512])
forward(diffusion_step)[source]

forward function of diffusion step embedding

Parameters:

diffusion_step (torch.Tensor) – which step of diffusion to execute

Returns:

diffusion step embedding

Return type:

tensor [bs, 512]

class speechbrain.lobes.models.DiffWave.SpectrogramUpsampler[source]

Bases: Module

Upsampler for spectrograms with Transposed Conv Only the upsampling is done here, the layer-specific Conv can be found in residual block to map the mel bands into 2× residual channels

Example

>>> from speechbrain.lobes.models.DiffWave import SpectrogramUpsampler
>>> spec_upsampler = SpectrogramUpsampler()
>>> mel_input = torch.rand(3, 80, 100)
>>> upsampled_mel = spec_upsampler(mel_input)
>>> upsampled_mel.shape
torch.Size([3, 80, 25600])
forward(x)[source]

Upsamples spectrograms 256 times to match the length of audios Hop length should be 256 when extracting mel spectrograms

Parameters:

x (torch.Tensor) – input mel spectrogram [bs, 80, mel_len]

Return type:

upsampled spectrogram [bs, 80, mel_len*256]

class speechbrain.lobes.models.DiffWave.ResidualBlock(n_mels, residual_channels, dilation, uncond=False)[source]

Bases: Module

Residual Block with dilated convolution

Parameters:
  • n_mels (int) – input mel channels of conv1x1 for conditional vocoding task

  • residual_channels (int) – channels of audio convolution

  • dilation (int) – dilation cycles of audio convolution

  • uncond (bool) – conditional/unconditional generation

Example

>>> from speechbrain.lobes.models.DiffWave import ResidualBlock
>>> res_block = ResidualBlock(n_mels=80, residual_channels=64, dilation=3)
>>> noisy_audio = torch.randn(1, 1, 22050)
>>> timestep_embedding = torch.rand(1, 512)
>>> upsampled_mel = torch.rand(1, 80, 22050)
>>> output = res_block(noisy_audio, timestep_embedding, upsampled_mel)
>>> output[0].shape
torch.Size([1, 64, 22050])
forward(x, diffusion_step, conditioner=None)[source]

forward function of Residual Block

Parameters:
  • x (torch.Tensor) – input sample [bs, 1, time]

  • diffusion_step (torch.Tensor) – the embedding of which step of diffusion to execute

  • conditioner (torch.Tensor) – the condition used for conditional generation

Returns:

  • residual output [bs, residual_channels, time]

  • a skip of residual branch [bs, residual_channels, time]

class speechbrain.lobes.models.DiffWave.DiffWave(input_channels, residual_layers, residual_channels, dilation_cycle_length, total_steps, unconditional=False)[source]

Bases: Module

DiffWave Model with dilated residual blocks

Parameters:
  • input_channels (int) – input mel channels of conv1x1 for conditional vocoding task

  • residual_layers (int) – number of residual blocks

  • residual_channels (int) – channels of audio convolution

  • dilation_cycle_length (int) – dilation cycles of audio convolution

  • total_steps (int) – total steps of diffusion

  • unconditional (bool) – conditional/unconditional generation

Example

>>> from speechbrain.lobes.models.DiffWave import DiffWave
>>> diffwave = DiffWave(
...     input_channels=80,
...     residual_layers=30,
...     residual_channels=64,
...     dilation_cycle_length=10,
...     total_steps=50,
... )
>>> noisy_audio = torch.randn(1, 1, 25600)
>>> timestep = torch.randint(50, (1,))
>>> input_mel = torch.rand(1, 80, 100)
>>> predicted_noise = diffwave(noisy_audio, timestep, input_mel)
>>> predicted_noise.shape
torch.Size([1, 1, 25600])
forward(audio, diffusion_step, spectrogram=None, length=None)[source]

DiffWave forward function

Parameters:
  • audio (torch.Tensor) – input gaussian sample [bs, 1, time]

  • diffusion_step (torch.Tensor) – which timestep of diffusion to execute [bs, 1]

  • spectrogram (torch.Tensor) – spectrogram data [bs, 80, mel_len]

  • length (torch.Tensor) – sample lengths - not used - provided for compatibility only

Return type:

predicted noise [bs, 1, time]

class speechbrain.lobes.models.DiffWave.DiffWaveDiffusion(model, timesteps=None, noise=None, beta_start=None, beta_end=None, sample_min=None, sample_max=None, show_progress=False)[source]

Bases: DenoisingDiffusion

An enhanced diffusion implementation with DiffWave-specific inference

Parameters:
  • model (nn.Module) – the underlying model

  • timesteps (int) – the total number of timesteps

  • noise (str|nn.Module) – the type of noise being used “gaussian” will produce standard Gaussian noise

  • beta_start (float) – the value of the “beta” parameter at the beginning of the process (see DiffWave paper)

  • beta_end (float) – the value of the “beta” parameter at the end of the process

  • sample_min (float)

  • sample_max (float) – Used to clip the output.

  • show_progress (bool) – whether to show progress during inference

Example

>>> from speechbrain.lobes.models.DiffWave import DiffWave
>>> diffwave = DiffWave(
...     input_channels=80,
...     residual_layers=30,
...     residual_channels=64,
...     dilation_cycle_length=10,
...     total_steps=50,
... )
>>> from speechbrain.lobes.models.DiffWave import DiffWaveDiffusion
>>> from speechbrain.nnet.diffusion import GaussianNoise
>>> diffusion = DiffWaveDiffusion(
...     model=diffwave,
...     beta_start=0.0001,
...     beta_end=0.05,
...     timesteps=50,
...     noise=GaussianNoise,
... )
>>> input_mel = torch.rand(1, 80, 100)
>>> output = diffusion.inference(
...     unconditional=False,
...     scale=256,
...     condition=input_mel,
...     fast_sampling=True,
...     fast_sampling_noise_schedule=[0.0001, 0.001, 0.01, 0.05, 0.2, 0.5],
... )
>>> output.shape
torch.Size([1, 25600])
inference(unconditional, scale, condition=None, fast_sampling=False, fast_sampling_noise_schedule=None, device=None)[source]

Processes the inference for diffwave One inference function for all the locally/globally conditional generation and unconditional generation tasks

Parameters:
  • unconditional (bool) – do unconditional generation if True, else do conditional generation

  • scale (int) – scale to get the final output wave length for conditional generation, the output wave length is scale * condition.shape[-1] for example, if the condition is spectrogram (bs, n_mel, time), scale should be hop length for unconditional generation, scale should be the desired audio length

  • condition (torch.Tensor) – input spectrogram for vocoding or other conditions for other conditional generation, should be None for unconditional generation

  • fast_sampling (bool) – whether to do fast sampling

  • fast_sampling_noise_schedule (list) – the noise schedules used for fast sampling

  • device (str|torch.device) – inference device

Returns:

predicted_sample – the predicted audio (bs, 1, t)

Return type:

torch.Tensor