Abstract: Neural vocoder using denoising diffusion probabilistic model (DDPM) has been improved by adaptation of the diffusion noise distribution to given acoustic features. In this study, we propose SpecGrad that adapts the diffusion noise so that its time-varying spectral envelope becomes close to the conditioning log-mel spectrogram. This adaptation by time-varying filtering improves the sound quality especially in the high-frequency bands. It is processed in the time-frequency domain to keep the computational cost almost the same as the conventional DDPM-based neural vocoders. Experimental results showed that SpecGrad generates higher-fidelity speech waveform than conventional DDPM-based neural vocoders in both analysis-synthesis and speech enhancement scenarios.
Comparison animation of WaveGrad [1], PriorGrad [2], and SpecGrad's waveform generation in 50 refinement iterations:
Analysis-synthesis examples of WaveGrad [1], PriorGrad [2], and SpecGrad:
Example 1: I can't speak for Scooby, but have you looked in the Mystery Machine?
Initial noise
3 iterations with WG-3 schedule
6 iterations with WG-6 schedule
50 iterations with WG-50 schedule
WaveGrad
PriorGrad
SpecGrad
Example 2: The dreaded, head pounding, body aching, feverish, nauseating, cough fest packs equal parts misery and inconvenience.
Initial noise
3 iterations with WG-3 schedule
6 iterations with WG-6 schedule
50 iterations with WG-50 schedule
WaveGrad
PriorGrad
SpecGrad
Example 3: If done right, it will also create jobs.
Initial noise
3 iterations with WG-3 schedule
6 iterations with WG-6 schedule
50 iterations with WG-50 schedule
WaveGrad
PriorGrad
SpecGrad
Example 4: There are many talented actors in the world.
Initial noise
3 iterations with WG-3 schedule
6 iterations with WG-6 schedule
50 iterations with WG-50 schedule
WaveGrad
PriorGrad
SpecGrad
Example 5: Nine hundred kilowatts times twenty four is a lot of watts.
Initial noise
3 iterations with WG-3 schedule
6 iterations with WG-6 schedule
50 iterations with WG-50 schedule
WaveGrad
PriorGrad
SpecGrad
Speech enhancement examples of WaveGrad [1], PriorGrad [2], and SpecGrad:
Example 1: I can't speak for Scooby, but have you looked in the Mystery Machine?
Noisy input
Ground-truth
WaveGrad
PriorGrad
SpecGrad
Example 2: The dreaded, head pounding, body aching, feverish, nauseating, cough fest packs equal parts misery and inconvenience.
Noisy input
Ground-truth
WaveGrad
PriorGrad
SpecGrad
Example 3: The new entity set about warping reality all over Scotland.
Noisy input
Ground-truth
WaveGrad
PriorGrad
SpecGrad
Example 4: Okay, two forty p.m.
Noisy input
Ground-truth
WaveGrad
PriorGrad
SpecGrad
Example 5: Aromatherapy. The use of aromatic plant extracts and essential oils in massage or baths.
Noisy input
Ground-truth
WaveGrad
PriorGrad
SpecGrad
References:
[1] N. Chen, Y. Zhang, H. Zen, R. J. Weiss, M. Norouzi, and W. Chan,“WaveGrad: Estimating gradients for waveform generation,” in Proc. Int. Conf. Learn. Represent. (ICLR), 2021. [paper]
[2] S. Lee, H. Kim, C. Shin, X. Tan, C. Liu, Q. Meng, T. Qin, W. Chen, S. Yoon, and T.-Y. Liu, “PriorGrad: Improving conditional denoising diffusion models with data-dependent adaptive prior,” in Proc. Int. Conf. Learn. Represent. (ICLR), 2022. [paper]