Taco-VC: A Single Speaker Tacotron based Voice Conversion with Limited Data

In this page you will find a short description of the Taco-VC system and converted audio samples.
The paper can be found here - https://arxiv.org/abs/1904.03522

Contact Information

Roee Levy Leshem , Raja Giryes
roeelev1@mail.tau.ac.il, raja@tauex.tau.ac.il
School of Electrical Engineering, Tel Aviv University, Tel Aviv, Israel

Voice Conversion

The purpose of voice conversion (VC) is to convert the speech of a source speaker into a given desired target speaker.
A successful conversion will preserve the linguistic and phonetic characteristics of the source audio while keeping naturalness and similarity to the target speaker.

Taco-VC Architecture

Taco-VC is a four stages architecture for high quality, non-parallel, many-to-one voice conversion.
Its advantage is that it requires for training, a big corpus of only a single speaker.
Phonetic Posteriorgrams (PPG) are being extracted from a phoneme recognition (PR) model to preserve the prosody of the source speech
Using a single speaker Tacotron, we synthesize the target Mel-Spectrograms (MSPEC) directly from the PPGs.
The synthesized MSPECs (SMSPEC) are passed through a speech enhancement network (Taco-SE), which outputs the speech enhanced SMSPECs (SE-SMSPEC).
Finally, a Wavenet vocoder is used to generate the target audio from the SE-SMPSECs.
We use the same acoustic features (80-band MSPECs) in our different networks.
The Tacotron and Wavenet are single speakers models which are trained first on the LJ Speech dataset [2], and then fine tuned to new targets with limited training data.

Conversion Process

Training Process

Audio Samples

The following target and source audio samples are from the VCC’18 SPOKE task [1].
The training set we use for reference includes two males and two females target speakers.
Each speaker has the same 81 content utterances for training, and 35 utterances for testing.
The whole training set is approximately 5 minutes of speech.
The target speakers are two males (VCC2TM1, VCC2TM2) and two females (VCC2TF1, VCC2TF2).
The source speakers are two males (VCC2SM3, VCC2SM4) and two females (VCC2SF3, VCC2SF4).

The converted utterances are being generated by adapting the Taco-VC to the target speaker.
The training is done only the target speaker training data, Taco-VC is not trained on the source speaker.

In the following samples you can here 4 conversions per each target.
The two converted audio files are the output of the following:

Target - Female VCC2TF1 - train Samples
Sample 1
Sample 2
Target - Female VCC2TF1 , Source - Female VCC2SF3
Source
Target
Taco-VC
Taco-VC-NoSe
Target - Female VCC2TF1 , Source - Female VCC2SF4
Source
Target
Taco-VC
Taco-VC-NoSe
Target - Female VCC2TF1 , Source - Male VCC2SM3
Source
Target
Taco-VC
Taco-VC-NoSe
Target - Female VCC2TF1 , Source - Male VCC2SM4
Source
Target
Taco-VC
Taco-VC-NoSe
Target - Male VCC2TM1 - train Samples
Sample 1
Sample 2
Target - Male VCC2TM1 , Source - Female VCC2SF3
Source
Target
Taco-VC
Taco-VC-NoSe
Target - Male VCC2TM1 , Source - Female VCC2SF4
Source
Target
Taco-VC
Taco-VC-NoSe
Target - Male VCC2TM1 , Source - Male VCC2SM3
Source
Target
Taco-VC
Taco-VC-NoSe
Target - Male VCC2TM1 , Source - Male VCC2SM4
Source
Target
Taco-VC
Taco-VC-NoSe
Target - Female VCC2TF2 - train Samples
Sample 1
Sample 2
Target - Female VCC2TF2 , Source - Female VCC2SF3
Source
Target
Taco-VC
Taco-VC-NoSe
Target - Female VCC2TF2 , Source - Female VCC2SF4
Source
Target
Taco-VC
Taco-VC-NoSe
Target - Female VCC2TF2 , Source - Male VCC2SM3
Source
Target
Taco-VC
Taco-VC-NoSe
Target - Female VCC2TF2 , Source - Male VCC2SM4
Source
Target
Taco-VC
Taco-VC-NoSe
Target - Male VCC2TM2 - train Samples
Sample 1
Sample 2
Target - Male VCC2TM2 , Source - Female VCC2SF3
Source
Target
Taco-VC
Taco-VC-NoSe
Target - Male VCC2TM2 , Source - Female VCC2SF4
Source
Target
Taco-VC
Taco-VC-NoSe
Target - Male VCC2TM2 , Source - Male VCC2SM3
Source
Target
Taco-VC
Taco-VC-NoSe
Target - Male VCC2TM2 , Source - Male VCC2SM4
Source
Target
Taco-VC
Taco-VC-NoSe

LJ Audio Samples

The following target is a single women speaker from the LJ speech corpus [1].
The source inputs were taken from the Blizzard 2012 corpus [3] which was not seen during training.

Target
Target
Target
Source
Source
Source
Taco-VC
Taco-VC
Taco-VC



[1] J. Lorenzo-Trueba et al., “The voice conversion challenge 2018: Promoting development of parallel and nonparallel methods,” Submitted to Odyssey, 2018.
[2] Keith Ito, “The LJ speech dataset,” 2017. [Online]. Available: https://keithito.com/LJ-Speech-Dataset/.
[3] S. King and V. Karaiskos, “The Blizzard Challenge 2012,” in Proceedings Blizzard Workshop, 2012.