Microsoft edges closer to zero-shot voice cloning


Microsoft presents NaturalSpeech 2, a text-to-speech model that is based on diffusion models and is capable of cloning any voice with just a short snippet of audio.

Microsoft Research Asia and Microsoft Azure Speech developed NaturalSpeech 2 using a diffusion model that interacts with a Neural Audio codec, which compresses waveforms into vectors. The team trained the Neural Audio Codec with 44,000 hours of speech and singing data, with the codec encoder learning to convert waveforms to vectors using residual vector quantizers (RVQ).

The RVQ uses several “codebooks” as templates for this process, compressing waveforms into predefined vectors. The codec encoder converts the quantized vectors back into waveforms. During training, the diffusion model learns to convert text into such quantized vectors, so that it can later pass arbitrary text input to the decoder, which converts it into speech or song.

Microsoft’s NaturalSpeech 2 outperforms VALL-E

NaturalSpeech 2 has over 400 million parameters and generates speech with different speaker identities, prosodies and styles (eg singing) in zero-shot scenarios where only a few seconds of speech are available. In experiments, the team shows that NaturalSpeech 2 is able to generate natural speech in these scenarios, outperforming the best text-to-speech systems to date, including Microsoft’s own VALL-E, which is also based on a diffusion model.


Text Prompt

And lay me down in my cold bed and leave my shining lot.

Audio Reference

Ground Truth


NaturalSpeech 2


The 44,000 hours of recordings used for training came from 5,000 different speakers and included recordings made under less than ideal studio conditions. The audio codec was trained using 8 Nvidia Tesla V100 (16 gigabytes) GPUs, and the diffusion model was trained using 16 V100 (32 gigabytes) GPUs.

NaturalSpeech 2: Microsoft warns of misuse

The team warns of possible misuse of the system: “NaturalSpeech 2 can synthesize speech with good expressiveness/fidelity and good similarity with a speech prompt, which could be potentially misused, such as speaker mimicking and voice spoofing.” Similar problems already exist with publicly available models, however. Microsoft did not announce plans to release NaturalSpeech 2.

In the future, the team plans to scale up the training and test it on much larger speech and singing datasets. They also want to make the model more efficient, for example by using the consistency models recently introduced by OpenAI as an alternative to diffusion models.

More examples are available on the NaturalSpeech 2 project page.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top