Boosting Prompting Mechanisms for Zero-Shot Speech Synthesis

Paper under double-blind review

Abstract

Zero-shot text-to-speech (TTS) aims to synthesize voices with unseen speech prompts, which significantly reduces the data and computation requirements for voice cloning by skipping the fine-tuning process. However, the prompting mechanisms of zero-shot TTS still face challenges in the following aspects: 1) previous works of zero-shot TTS are typically trained with single-sentence prompts, which significantly restricts their performance when the data is relatively sufficient during the inference stage. 2) The prosodic information in prompts is highly coupled with timbre, making it untransferable to each other. This paper introduces Mega-TTS, a generic prompting mechanism for zero-shot TTS, to tackle the aforementioned challenges. Specifically, we design a powerful acoustic autoencoder that separately encodes the prosody and timbre information into the compressed latent space while providing high-quality reconstructions. Then, we propose a multi-reference timbre encoder and a prosody latent language model (P-LLM) to extract useful information from multi-sentence prompts. We further leverage the probabilities derived from multiple P-LLM outputs to produce transferable and controllable prosody. Experimental results demonstrate that Mega-TTS could not only synthesize identity-preserving speech with a short prompt of an unseen speaker from arbitrary sources but consistently outperform the fine-tuning method when the volume of data ranges from 10 seconds to 5 minutes. Furthermore, our method enables to transfer various speaking styles to the target timbre in a fine-grained and controlled manner. Audio samples can be found in this demo page.

Model Overview

Interpolate start reference image.

Zero-shot Text-to-Speech (Section 4.2)

We list the speech examples used in our subjective evaluations here.

In the following part, all of the systems have 10 seconds of data available for each speaker.
Target Text Prompt Example GT VALL-E Fine-tune Ours

It is true that the horses are here, but the Hurons are gone; let us, then, hunt for the path by which they parted.

If I must say it, mother, I want to go away, and get out of this dead level.

The first mill plant was placed in the woollen factory of james Harrison at Newburgh, New York, about september fifteenth eighteen eighty one.

There their sad condition evoked for a time general commiseration.

Now in this of Burne Jones, the landscape is clearly full of light everywhere, color or glass light: that is, the outline is prepared for modification of color only.

When they entered the stage box on the left the first act was well under way, the scene being the interior of a cabin in the south of Ireland.

In the following part, all of the systems have 60 seconds of data available for each speaker.

(VALL-E fails to generate reasonable speeches with prompts longer than 20 seconds in our experiments and is also out of memory with 60-second prompts.)

Target Text Prompt Example GT VALL-E (20s) Fine-tune Ours

It is true that the horses are here, but the Hurons are gone; let us, then, hunt for the path by which they parted.

If I must say it, mother, I want to go away, and get out of this dead level.

The first mill plant was placed in the woollen factory of james Harrison at Newburgh, New York, about september fifteenth eighteen eighty one.

There their sad condition evoked for a time general commiseration.

Now in this of Burne Jones, the landscape is clearly full of light everywhere, color or glass light: that is, the outline is prepared for modification of color only.

When they entered the stage box on the left the first act was well under way, the scene being the interior of a cabin in the south of Ireland.

In the following part, all of the systems have 300 seconds of data available for each speaker.
Target Text Prompt Example GT Fine-tune Ours

It is true that the horses are here, but the Hurons are gone; let us, then, hunt for the path by which they parted.

If I must say it, mother, I want to go away, and get out of this dead level.

The first mill plant was placed in the woollen factory of james Harrison at Newburgh, New York, about september fifteenth eighteen eighty one.

There their sad condition evoked for a time general commiseration.

Now in this of Burne Jones, the landscape is clearly full of light everywhere, color or glass light: that is, the outline is prepared for modification of color only.

When they entered the stage box on the left the first act was well under way, the scene being the interior of a cabin in the south of Ireland.

(Supplementary Results)
Here, we conduct long-form zero-shot TTS experiments. All of the systems have 10 seconds of data.
Target Text Prompt Example VALL-E Fine-tune Ours

Dig anywhere in Pompeii and you'll uncover an ancient treasure, a snapshot of a lost Roman world. Most of those who lived in the city probably escaped the eruption of Mount Vesuvius, but what they left behind is still being unearthed 2,000 years later. The new discoveries from this excavation include a huge oven, capable of producing 100 loaves a day, a shrine to make offerings to the gods that's decorated with snakes, and a bed reduced to ashes in a fire. There are skeletons too – three individuals crushed under falling masonry. The archaeologists plan to erect a walkway, so tourists can see what's going on. It will prove popular. This is the dig that recently revealed a fresco, depicting a bread meal that looked like it might be a forerunner of pizza.

(Failed)
(Failed)
(Additional Examples)
Here, we perform voice cloning for classic anime characters.
Target Text Name Prompt Ours

Let's go drink until we can't feel feelings anymore.

Sponge Bob

Uh, it's not like the internet to go crazy about something small and stupid.

Peter Griffin

Then I would never talk to that person about boa constrictors, or primeval forests, or stars. I would bring myself down to his level.

Rick

In what a disgraceful light might it not strike so vain a man!

Morty

Prosody Transfer (Section 4.3)

We list the speech examples used in our subjective evaluations here.

γ is the tempereture of the proposed prosody interpolation technique. (Section 3.4)

Target Text Source Speech Prosody Reference Daft-Exprt Ours
You have the right to abandon me, and I have the capital to make you regret.
My heart aches with sadness and tears stream down my face.
She bounced into the classroom, full of the joys of spring.
(Additional Examples)
Some interesting applications for zero-shot prosody transfer.
Setting Source Speech Prosody Reference Generated Examples

Start rapping!

News broadcast

Additional Examples for Rebuttal

Noisy Reference Prompts
Target Text Prompt GT VALL-E Ours

The old king went out and fought bravely, but at last his sword broke, and he was wounded and his men fled.

Dampness was invading it, the flowers were deserting it.

Captain Blizzard's round pink face creased in his winning smile

Non-standard Speech Prompts
Type of Prompt Speech Prompt Generated Examples

Laughing

Crying

Cheering

Chinese Examples
Target Text Prompt Ours

非常抱歉,也就是想了解一下您有多大的可能会将我们的广告服务推荐给其他人?

是这样的,这个数学冲刺营是给体验学员的福利,三节课都是免费的。

你好,是我,我有一个小游戏哦,你想和我一起玩吗?

Cross-lingual Examples (Chinese -> English)
Target Text Prompt Ours

He was received with great enthusiasm by the employer, who congratulated him on possessing so valuable a slave.

He simply stared at her fixedly with that peculiar expression on his face.

The little prince also pulled up, with a certain sense of dejection, the last little shoots of the baobabs.