StyleTTS-VC

One-shot voice conversion (VC) aims to convert speech from any source speaker to an arbitrary target speaker with only one reference speech audio from the target speaker, which relies heavily on disentangling the speaker’s identity and speech content, a task that still remains challenging. Here, we propose a novel approach to learning disentangled speech representation by transfer learning from style-based text-to-speech (TTS) models. With cycle consistent and adversarial training, the style-based TTS models can perform transcription-guided one-shot VC with high fidelity and similarity. By learning an additional mel-spectrogram encoder through a teacher-student knowledge transfer and novel data augmentation scheme, our approach results in disentangled speech representation without needing the input text. The subjective evaluation shows that our approach can significantly outperform the previous state-of-the-art one-shot voice conversion models in both naturalness and similarity.

Any-to-Any Conversion

All of the following audios are converted from an unseen speaker to another unseen speaker during training. For a fair comparison to the baseline models, all audios are downsampled to 16k Hz. The input to VC models was trimmed so the output has a different length from the input.

All utterances are completely unseen during training, and the results are uncurated (NOT cherry-picked) unless otherwise specified.

For more audio samples, please go to our survey used for MOS evaluation here. You may have to randomly select some answers before proceeding to the next page.

Sample 1 and 2