We list the speech examples used in our subjective evaluations here.
Zero-shot text-to-speech (TTS) aims to synthesize voices with unseen speech prompts, which significantly reduces the data and computation requirements for voice cloning by skipping the fine-tuning process. However, the prompting mechanisms of zero-shot TTS still face challenges in the following aspects: 1) previous works of zero-shot TTS are typically trained with single-sentence prompts, which significantly restricts their performance when the data is relatively sufficient during the inference stage. 2) The prosodic information in prompts is highly coupled with timbre, making it untransferable to each other. This paper introduces Mega-TTS, a generic prompting mechanism for zero-shot TTS, to tackle the aforementioned challenges. Specifically, we design a powerful acoustic autoencoder that separately encodes the prosody and timbre information into the compressed latent space while providing high-quality reconstructions. Then, we propose a multi-reference timbre encoder and a prosody latent language model (P-LLM) to extract useful information from multi-sentence prompts. We further leverage the probabilities derived from multiple P-LLM outputs to produce transferable and controllable prosody. Experimental results demonstrate that Mega-TTS could not only synthesize identity-preserving speech with a short prompt of an unseen speaker from arbitrary sources but consistently outperform the fine-tuning method when the volume of data ranges from 10 seconds to 5 minutes. Furthermore, our method enables to transfer various speaking styles to the target timbre in a fine-grained and controlled manner. Audio samples can be found in this demo page.
We list the speech examples used in our subjective evaluations here.
Target Text | Prompt Example | GT | VALL-E | Fine-tune | Ours |
---|---|---|---|---|---|
It is true that the horses are here, but the Hurons are gone; let us, then, hunt for the path by which they parted. |
|||||
If I must say it, mother, I want to go away, and get out of this dead level. |
|||||
The first mill plant was placed in the woollen factory of james Harrison at Newburgh, New York, about september fifteenth eighteen eighty one. |
|||||
There their sad condition evoked for a time general commiseration. |
|||||
Now in this of Burne Jones, the landscape is clearly full of light everywhere, color or glass light: that is, the outline is prepared for modification of color only. |
|||||
When they entered the stage box on the left the first act was well under way, the scene being the interior of a cabin in the south of Ireland. |
(VALL-E fails to generate reasonable speeches with prompts longer than 20 seconds in our experiments and is also out of memory with 60-second prompts.)
Target Text | Prompt Example | GT | VALL-E (20s) | Fine-tune | Ours |
---|---|---|---|---|---|
It is true that the horses are here, but the Hurons are gone; let us, then, hunt for the path by which they parted. |
|||||
If I must say it, mother, I want to go away, and get out of this dead level. |
|||||
The first mill plant was placed in the woollen factory of james Harrison at Newburgh, New York, about september fifteenth eighteen eighty one. |
|||||
There their sad condition evoked for a time general commiseration. |
|||||
Now in this of Burne Jones, the landscape is clearly full of light everywhere, color or glass light: that is, the outline is prepared for modification of color only. |
|||||
When they entered the stage box on the left the first act was well under way, the scene being the interior of a cabin in the south of Ireland. |
Target Text | Prompt Example | GT | Fine-tune | Ours |
---|---|---|---|---|
It is true that the horses are here, but the Hurons are gone; let us, then, hunt for the path by which they parted. |
||||
If I must say it, mother, I want to go away, and get out of this dead level. |
||||
The first mill plant was placed in the woollen factory of james Harrison at Newburgh, New York, about september fifteenth eighteen eighty one. |
||||
There their sad condition evoked for a time general commiseration. |
||||
Now in this of Burne Jones, the landscape is clearly full of light everywhere, color or glass light: that is, the outline is prepared for modification of color only. |
||||
When they entered the stage box on the left the first act was well under way, the scene being the interior of a cabin in the south of Ireland. |
Target Text | Prompt Example | VALL-E | Fine-tune | Ours |
---|---|---|---|---|
Dig anywhere in Pompeii and you'll uncover an ancient treasure, a snapshot of a lost Roman world. Most of those who lived in the city probably escaped the eruption of Mount Vesuvius, but what they left behind is still being unearthed 2,000 years later. The new discoveries from this excavation include a huge oven, capable of producing 100 loaves a day, a shrine to make offerings to the gods that's decorated with snakes, and a bed reduced to ashes in a fire. There are skeletons too – three individuals crushed under falling masonry. The archaeologists plan to erect a walkway, so tourists can see what's going on. It will prove popular. This is the dig that recently revealed a fresco, depicting a bread meal that looked like it might be a forerunner of pizza. |
(Failed) | |||
(Failed) |
Target Text | Name | Prompt | Ours |
---|---|---|---|
Let's go drink until we can't feel feelings anymore. |
Sponge Bob |
||
Uh, it's not like the internet to go crazy about something small and stupid. |
Peter Griffin |
||
Then I would never talk to that person about boa constrictors, or primeval forests, or stars. I would bring myself down to his level. |
Rick |
||
In what a disgraceful light might it not strike so vain a man! |
Morty |
We list the speech examples used in our subjective evaluations here.
γ is the tempereture of the proposed prosody interpolation technique. (Section 3.4)
Target Text | Source Speech | Prosody Reference | Daft-Exprt | Ours |
---|---|---|---|---|
You have the right to abandon me, and I have the capital to make you regret. | ||||
My heart aches with sadness and tears stream down my face. | ||||
She bounced into the classroom, full of the joys of spring. |
Setting | Source Speech | Prosody Reference | Generated Examples |
---|---|---|---|
Start rapping! |
|||
News broadcast |
Target Text | Prompt | GT | VALL-E | Ours |
---|---|---|---|---|
The old king went out and fought bravely, but at last his sword broke, and he was wounded and his men fled. |
||||
Dampness was invading it, the flowers were deserting it. |
||||
Captain Blizzard's round pink face creased in his winning smile |
Type of Prompt | Speech Prompt | Generated Examples |
---|---|---|
Laughing |
||
Crying |
||
Cheering |
Target Text | Prompt | Ours |
---|---|---|
非常抱歉,也就是想了解一下您有多大的可能会将我们的广告服务推荐给其他人? |
||
是这样的,这个数学冲刺营是给体验学员的福利,三节课都是免费的。 |
||
你好,是我,我有一个小游戏哦,你想和我一起玩吗? |
Target Text | Prompt | Ours |
---|---|---|
He was received with great enthusiasm by the employer, who congratulated him on possessing so valuable a slave. |
||
He simply stared at her fixedly with that peculiar expression on his face. |
||
The little prince also pulled up, with a certain sense of dejection, the last little shoots of the baobabs. |