speechbrain.lobes.models.DiffWave module

Neural network modules for DIFFWAVE: A VERSATILE DIFFUSION MODEL FOR AUDIO SYNTHESIS

For more details: https://arxiv.org/pdf/2009.09761.pdf

Authors

Yingzhi WANG 2022

Summary

Classes:

`DiffWave`	DiffWave Model with dilated residual blocks
`DiffWaveDiffusion`	An enhanced diffusion implementation with DiffWave-specific inference
`DiffusionEmbedding`	Embeds the diffusion step into an input vector of DiffWave
`ResidualBlock`	Residual Block with dilated convolution
`SpectrogramUpsampler`	Upsampler for spectrograms with Transposed Conv Only the upsampling is done here, the layer-specific Conv can be found in residual block to map the mel bands into 2× residual channels

Functions:

diffwave_mel_spectogram

calculates MelSpectrogram for a raw audio signal and preprocesses it for diffwave training

Reference

speechbrain.lobes.models.DiffWave.diffwave_mel_spectogram(sample_rate, hop_length, win_length, n_fft, n_mels, f_min, f_max, power, normalized, norm, mel_scale, audio)[source]

calculates MelSpectrogram for a raw audio signal and preprocesses it for diffwave training

Parameters:

sample_rate (int) – Sample rate of audio signal.
hop_length (int) – Length of hop between STFT windows.
win_length (int) – Window size.
n_fft (int) – Size of FFT.
n_mels (int) – Number of mel filterbanks.
f_min (float) – Minimum frequency.
f_max (float) – Maximum frequency.
power (float) – Exponent for the magnitude spectrogram.
normalized (bool) – Whether to normalize by magnitude after stft.
norm (str or None) – If “slaney”, divide the triangular mel weights by the width of the mel band
mel_scale (str) – Scale to use: “htk” or “slaney”.
audio (torch.tensor) – input audio signal

Returns:

mel

Return type:

torch.Tensor

class speechbrain.lobes.models.DiffWave.DiffusionEmbedding(max_steps)[source]

Bases: Module

Embeds the diffusion step into an input vector of DiffWave

Parameters:: max_steps (int) – total diffusion steps

Example

>>> from speechbrain.lobes.models.DiffWave import DiffusionEmbedding
>>> diffusion_embedding = DiffusionEmbedding(max_steps=50)
>>> time_step = torch.randint(50, (1,))
>>> step_embedding = diffusion_embedding(time_step)
>>> step_embedding.shape
torch.Size([1, 512])

forward(diffusion_step)[source]

forward function of diffusion step embedding

Parameters:: diffusion_step (torch.Tensor) – which step of diffusion to execute
Returns:: diffusion step embedding
Return type:: tensor [bs, 512]

class speechbrain.lobes.models.DiffWave.SpectrogramUpsampler[source]

Bases: Module

Upsampler for spectrograms with Transposed Conv Only the upsampling is done here, the layer-specific Conv can be found in residual block to map the mel bands into 2× residual channels

Example

>>> from speechbrain.lobes.models.DiffWave import SpectrogramUpsampler
>>> spec_upsampler = SpectrogramUpsampler()
>>> mel_input = torch.rand(3, 80, 100)
>>> upsampled_mel = spec_upsampler(mel_input)
>>> upsampled_mel.shape
torch.Size([3, 80, 25600])

forward(x)[source]

Upsamples spectrograms 256 times to match the length of audios Hop length should be 256 when extracting mel spectrograms

Parameters:: x (torch.Tensor) – input mel spectrogram [bs, 80, mel_len]
Return type:: upsampled spectrogram [bs, 80, mel_len*256]

class speechbrain.lobes.models.DiffWave.ResidualBlock(n_mels, residual_channels, dilation, uncond=False)[source]

Bases: Module

Residual Block with dilated convolution

Parameters:

n_mels (int) – input mel channels of conv1x1 for conditional vocoding task
residual_channels (int) – channels of audio convolution
dilation (int) – dilation cycles of audio convolution
uncond (bool) – conditional/unconditional generation

Example

>>> from speechbrain.lobes.models.DiffWave import ResidualBlock
>>> res_block = ResidualBlock(n_mels=80, residual_channels=64, dilation=3)
>>> noisy_audio = torch.randn(1, 1, 22050)
>>> timestep_embedding = torch.rand(1, 512)
>>> upsampled_mel = torch.rand(1, 80, 22050)
>>> output = res_block(noisy_audio, timestep_embedding, upsampled_mel)
>>> output[0].shape
torch.Size([1, 64, 22050])

forward(x, diffusion_step, conditioner=None)[source]

forward function of Residual Block

Parameters:

x (torch.Tensor) – input sample [bs, 1, time]
diffusion_step (torch.Tensor) – the embedding of which step of diffusion to execute
conditioner (torch.Tensor) – the condition used for conditional generation

Returns:

residual output [bs, residual_channels, time]
a skip of residual branch [bs, residual_channels, time]

class speechbrain.lobes.models.DiffWave.DiffWave(input_channels, residual_layers, residual_channels, dilation_cycle_length, total_steps, unconditional=False)[source]

Bases: Module

DiffWave Model with dilated residual blocks

Parameters:

input_channels (int) – input mel channels of conv1x1 for conditional vocoding task
residual_layers (int) – number of residual blocks
residual_channels (int) – channels of audio convolution
dilation_cycle_length (int) – dilation cycles of audio convolution
total_steps (int) – total steps of diffusion
unconditional (bool) – conditional/unconditional generation

Example

>>> from speechbrain.lobes.models.DiffWave import DiffWave
>>> diffwave = DiffWave(
...     input_channels=80,
...     residual_layers=30,
...     residual_channels=64,
...     dilation_cycle_length=10,
...     total_steps=50,
... )
>>> noisy_audio = torch.randn(1, 1, 25600)
>>> timestep = torch.randint(50, (1,))
>>> input_mel = torch.rand(1, 80, 100)
>>> predicted_noise = diffwave(noisy_audio, timestep, input_mel)
>>> predicted_noise.shape
torch.Size([1, 1, 25600])

forward(audio, diffusion_step, spectrogram=None, length=None)[source]

DiffWave forward function

Parameters:

audio (torch.Tensor) – input gaussian sample [bs, 1, time]
diffusion_step (torch.Tensor) – which timestep of diffusion to execute [bs, 1]
spectrogram (torch.Tensor) – spectrogram data [bs, 80, mel_len]
length (torch.Tensor) – sample lengths - not used - provided for compatibility only

Return type:

predicted noise [bs, 1, time]

class speechbrain.lobes.models.DiffWave.DiffWaveDiffusion(model, timesteps=None, noise=None, beta_start=None, beta_end=None, sample_min=None, sample_max=None, show_progress=False)[source]

Bases: DenoisingDiffusion

An enhanced diffusion implementation with DiffWave-specific inference

Parameters:

model (nn.Module) – the underlying model
timesteps (int) – the total number of timesteps
noise (str|nn.Module) – the type of noise being used “gaussian” will produce standard Gaussian noise
beta_start (float) – the value of the “beta” parameter at the beginning of the process (see DiffWave paper)
beta_end (float) – the value of the “beta” parameter at the end of the process
sample_min (float)
sample_max (float) – Used to clip the output.
show_progress (bool) – whether to show progress during inference

Example

>>> from speechbrain.lobes.models.DiffWave import DiffWave
>>> diffwave = DiffWave(
...     input_channels=80,
...     residual_layers=30,
...     residual_channels=64,
...     dilation_cycle_length=10,
...     total_steps=50,
... )
>>> from speechbrain.lobes.models.DiffWave import DiffWaveDiffusion
>>> from speechbrain.nnet.diffusion import GaussianNoise
>>> diffusion = DiffWaveDiffusion(
...     model=diffwave,
...     beta_start=0.0001,
...     beta_end=0.05,
...     timesteps=50,
...     noise=GaussianNoise,
... )
>>> input_mel = torch.rand(1, 80, 100)
>>> output = diffusion.inference(
...     unconditional=False,
...     scale=256,
...     condition=input_mel,
...     fast_sampling=True,
...     fast_sampling_noise_schedule=[0.0001, 0.001, 0.01, 0.05, 0.2, 0.5],
... )
>>> output.shape
torch.Size([1, 25600])

inference(unconditional, scale, condition=None, fast_sampling=False, fast_sampling_noise_schedule=None, device=None)[source]

Processes the inference for diffwave One inference function for all the locally/globally conditional generation and unconditional generation tasks

Parameters:

unconditional (bool) – do unconditional generation if True, else do conditional generation
scale (int) – scale to get the final output wave length for conditional generation, the output wave length is scale * condition.shape[-1] for example, if the condition is spectrogram (bs, n_mel, time), scale should be hop length for unconditional generation, scale should be the desired audio length
condition (torch.Tensor) – input spectrogram for vocoding or other conditions for other conditional generation, should be None for unconditional generation
fast_sampling (bool) – whether to do fast sampling
fast_sampling_noise_schedule (list) – the noise schedules used for fast sampling
device (str|torch.device) – inference device

Returns:

predicted_sample – the predicted audio (bs, 1, t)

Return type:

torch.Tensor