SpecGrad: Diffusion Probabilistic Model based Neural Vocoder with Adaptive Noise Spectral Shaping

Yuma Koizumi, Heiga Zen, Kohei Yatabe, Nanxin Chen, Michiel Bacchiani

Abstract: Neural vocoder using denoising diffusion probabilistic model (DDPM) has been improved by adaptation of the diffusion noise distribution to given acoustic features. In this study, we propose SpecGrad that adapts the diffusion noise so that its time-varying spectral envelope becomes close to the conditioning log-mel spectrogram. This adaptation by time-varying filtering improves the sound quality especially in the high-frequency bands. It is processed in the time-frequency domain to keep the computational cost almost the same as the conventional DDPM-based neural vocoders. Experimental results showed that SpecGrad generates higher-fidelity speech waveform than conventional DDPM-based neural vocoders in both analysis-synthesis and speech enhancement scenarios.

Comparison animation of WaveGrad [1], PriorGrad [2], and SpecGrad's waveform generation in 50 refinement iterations:

Analysis-synthesis examples of WaveGrad [1], PriorGrad [2], and SpecGrad:

Example 1: I can't speak for Scooby, but have you looked in the Mystery Machine?

Initial noise
3 iterations with WG-3 schedule
6 iterations with WG-6 schedule
50 iterations with WG-50 schedule

WaveGrad

PriorGrad

SpecGrad

Example 2: The dreaded, head pounding, body aching, feverish, nauseating, cough fest packs equal parts misery and inconvenience.

Initial noise
3 iterations with WG-3 schedule
6 iterations with WG-6 schedule
50 iterations with WG-50 schedule

WaveGrad

PriorGrad

SpecGrad

Example 3: If done right, it will also create jobs.

Initial noise
3 iterations with WG-3 schedule
6 iterations with WG-6 schedule
50 iterations with WG-50 schedule

WaveGrad

PriorGrad

SpecGrad

Example 4: There are many talented actors in the world.

Initial noise
3 iterations with WG-3 schedule
6 iterations with WG-6 schedule
50 iterations with WG-50 schedule

WaveGrad

PriorGrad

SpecGrad

Example 5: Nine hundred kilowatts times twenty four is a lot of watts.

Initial noise
3 iterations with WG-3 schedule
6 iterations with WG-6 schedule
50 iterations with WG-50 schedule

WaveGrad

PriorGrad

SpecGrad

Speech enhancement examples of WaveGrad [1], PriorGrad [2], and SpecGrad:

Example 1: I can't speak for Scooby, but have you looked in the Mystery Machine?

Noisy input

Ground-truth

WaveGrad

PriorGrad

SpecGrad

Example 2: The dreaded, head pounding, body aching, feverish, nauseating, cough fest packs equal parts misery and inconvenience.

Noisy input

Ground-truth

WaveGrad

PriorGrad

SpecGrad

Example 3: The new entity set about warping reality all over Scotland.

Noisy input

Ground-truth

WaveGrad

PriorGrad

SpecGrad

Example 4: Okay, two forty p.m.

Noisy input

Ground-truth

WaveGrad

PriorGrad

SpecGrad

Example 5: Aromatherapy. The use of aromatic plant extracts and essential oils in massage or baths.

Noisy input

Ground-truth

WaveGrad

PriorGrad

SpecGrad

References:

[1] N. Chen, Y. Zhang, H. Zen, R. J. Weiss, M. Norouzi, and W. Chan,“WaveGrad: Estimating gradients for waveform generation,” in Proc. Int. Conf. Learn. Represent. (ICLR), 2021. [paper]
[2] S. Lee, H. Kim, C. Shin, X. Tan, C. Liu, Q. Meng, T. Qin, W. Chen, S. Yoon, and T.-Y. Liu, “PriorGrad: Improving conditional denoising diffusion models with data-dependent adaptive prior,” in Proc. Int. Conf. Learn. Represent. (ICLR), 2022. [paper]

	Initial noise	3 iterations with WG-3 schedule	6 iterations with WG-6 schedule	50 iterations with WG-50 schedule
WaveGrad
PriorGrad
SpecGrad

Contents:

Comparison animation of WaveGrad [1], PriorGrad [2], and SpecGrad's waveform generation in 50 refinement iterations:

Analysis-synthesis examples of WaveGrad [1], PriorGrad [2], and SpecGrad:

Speech enhancement examples of WaveGrad [1], PriorGrad [2], and SpecGrad:

References: