小編精選 - 技術文章翻譯 · 05月12日

如何提示 Gemini 3.1 的新文字轉語音模型

Gemini 3.1 Flash 文字轉語音 (TTS)是一款全新的模型，您可以對其進行定向，從而獲得所需的精確音訊效果。在這篇文章中，我將分享一些關於如何使用提示來引導該模型的技巧，並展示其優勢範例。

gemini-3.1-flash-tts-preview開箱即用，能夠原生辨識文字並自動決定語音的朗讀方式。無需任何額外提示，簡單的文字朗讀聽起來就非常自然。此外，3.1 Flash TTS 還配備了一些工具，方便您進行個人化設定。

你可以為模型提供豐富的上下文訊息，例如音訊特徵——說話者是誰、說話方式、聲音聽起來如何等等。你也可以描述場景，包括人物所在位置、正在做什麼、周圍環境，以及任何額外的「導演筆記」來指導表演。模型將利用這些資訊產生符合該上下文的語音。

現在您也可以使用標籤來控製文字特定部分的呈現方式。標籤是類似[whispers]或[laughs]這樣的內聯修飾符，讓您可以精細地控製文字的呈現方式。您可以使用它們來改變文字中某一行或某一部分的語氣、語速和情緒氛圍。您也可以使用它們來加入感嘆詞和其他一些非語言聲音，例如[咳嗽]、[嘆氣]或[倒吸一口氣]。

您可以使用的標籤沒有限制。您可以盡情發揮創意，在方括號[]內輸入內容，模型會盡力理解和解釋它們。

簡潔的文字稿和創意標籤

為了展示僅使用標籤就能實現的差異性，這裡有一組範例，每個範例都以相同的聲音表達相同的內容，但根據我使用的標籤不同，發音也會有所不同。我選擇了Algenib聲音，這是一個略帶沙啞的男性聲音。

以下是去掉標籤後的音訊：

你好，我是一個全新的文字轉語音模型，我可以以多種不同的方式表達意思。今天我能幫到您什麼嗎？

https://youtu.be/8tSBP7nJMxE

讓我們先從語氣的變化說起，說話者要么感到無聊，要么不情願，要么很興奮，我們都能聽出來：

[興奮地] 嘿，我是一個全新的文字轉語音模型…

https://youtu.be/fM4KFhJHBpw

[無聊] 嘿，我是一個新開發的文字轉語音模型…

https://youtu.be/RZICUknVytA

[不情願地] 嘿，我是一個新的文字轉語音模型…

https://youtu.be/h5bl4reMF1s

我們也可以使用標籤來改變語速，並將其與強調結合：

【語速很快】嘿，我是一個新的文字轉語音模型…

https://youtu.be/Akjcgw-KxXY

（語速很慢）嘿，我是一個新的文字轉語音模型…

https://youtu.be/Bw-YOQfS0q8

【語氣諷刺，語速慢得令人難以忍受】嘿，我是一個新推出的文字轉語音模型…

https://youtu.be/I6rVSrFWbvw

標籤還可以對各個部分進行精確控制，因此我們可以低聲說一些話，然後再大聲說一些話，或者使用任何你想要的組合：

[asmr] 嘿，我是一個全新的文字轉語音模型，[低沉響亮的喊叫聲] 我可以有很多種不同的表達方式。 [asmr] 今天有什麼可以幫你的嗎？

https://youtu.be/1AtGVH1Fb-o

你真的可以嘗試各種各樣的事情：

[像狗叫一樣] 嘿，我是一個全新的文字轉語音模型…

https://youtu.be/dUDO-MhyLJg

[像德古拉一樣] 嘿，我是一個全新的文字轉語音模型…

https://youtu.be/YXuzDWZNyLQ

[唱歌] 嘿，我是一個新的文字轉語音模型…

https://youtu.be/lAmE6OecPzM

您也可以嘗試以下標籤：

[驚訝]
[哭泣]
[好奇的]
[喘氣]
[咯咯笑]
[頑皮地]
[驚慌失措]
[諷刺]
[嚴肅的]
[嘆氣]
[嗤笑]
[疲勞的]
[發抖]

標籤讓我們能夠快速且方便地控製文字稿的呈現方式。我們還可以將它們與上下文提示結合使用，以營造表演的整體基調和氛圍。

背景與表現

透過提供細緻入微的指令，例如精確的地理口音、氣息或語速等具體特徵，您可以利用模型的上下文感知能力來產生動態、自然且富有表現力的音訊表演。這樣就避免了每次微編輯都需要使用標籤。

當文字記錄和提示語一致時，效果最佳，這樣「誰在說話」與「說了什麼」和「怎麼說」就相符。

提示結構

一份好的提示訊息在正式呈現文字稿之前應包含以下幾個關鍵要素：

音訊設定檔
場景
導演筆記

這些部分都是可選的，但它們可以幫助模型理解您想要表達的脈絡和表現方式。您可以將它們視為系統指令，用於根據不同的文字記錄產生聽起來一致的輸出。

音訊設定檔

這是你聲音的人物。你可以定義角色身份、原型以及任何其他特徵，例如年齡或背景。

給角色取個名字有助於塑造人物形象，讓表演更加連貫。在描述場景和背景時，你可以用名字來指稱角色。此外，明確角色的身份也很有幫助，例如他們是電台DJ、播客主播還是新聞記者。

場景

場景設定了故事背景。地點、氛圍和環境細節共同定義了故事的基調和氛圍。你應該描述角色周圍發生的事情以及這些事情如何影響他們。場景為模特兒提供了整個互動過程的環境背景，並將以一種微妙而自然的方式引導表演。例如，清晨繁忙的咖啡館裡的對話、專業錄音室裡的DJ，或是繁忙機場裡的廣播。

## THE SCENE: The London Studio
It is 10:00 PM in a glass-walled studio overlooking the moonlit London skyline, but inside, it is blindingly bright. The red "ON AIR" tally light is blazing. Jaz is standing up, not sitting, bouncing on the balls of their heels to the rhythm of a thumping backing track. Their hands fly across the faders on a massive mixing desk. It is a chaotic, caffeine-fueled cockpit designed to wake up an entire nation.

導演筆記

導演筆記是為模特兒提供的表演指導。最常見的指導包括風格、節奏和口音，但模特兒表演的內容並不局限於此。您可以根據自己的表演需要，加入自訂說明，並根據需要詳略調整。

### DIRECTOR'S NOTES

Style: Enthusiastic and Sassy GenZ beauty YouTuber

Accent: Southern california valley girl from Laguna Beach

Pacing: Speaks at an energetic pace, keeping up with the extremely fast, rapid delivery influencers use in short form videos.

風格

風格決定了演講的基調。可以使用諸如歡快、充滿活力、輕鬆或略顯無聊等詞語來引導演講內容。描述要生動，並提供必要的細節。例如，「充滿感染力的熱情。聽眾應該感覺自己彷彿置身於一場盛大的、激動人心的社區活動中。」比簡單地說「充滿活力和熱情」效果要好得多。

你甚至可以嘗試一些配音行業常用的術語，例如「聲音微笑」。你可以根據需要疊加任意多種風格特徵。

Style: Sassy GenZ beauty YouTuber, who mostly creates content for YouTube Shorts.

口音

請描述您想要的口音。描述越具體，效果越好。例如，請使用“英國克羅伊登地區常見的英式英語口音”，而不是僅僅使用“英國口音”。

Accent: Jaz is a DJ from Brixton, London

步調

您也可以指定整首樂曲的整體節奏和節奏變化。

Pacing: The "Drift": The tempo is incredibly slow and liquid. Words bleed into each other. There is zero urgency.

完整提示範例

以下是一個完整提示的範例：

# AUDIO PROFILE: Jaz R.
## "The Morning Hype"

## THE SCENE: The London Studio
It is 10:00 PM in a glass-walled studio overlooking the moonlit London skyline, but inside, it is blindingly bright. The red "ON AIR" tally light is blazing. Jaz is standing up, not sitting, bouncing on the balls of their heels to the rhythm of a thumping backing track. Their hands fly across the faders on a massive mixing desk. It is a chaotic, caffeine-fueled cockpit designed to wake up an entire nation.

### DIRECTOR'S NOTES
Style:
* The "Vocal Smile": You must hear the grin in the audio. The soft palate is always raised to keep the tone bright, sunny, and explicitly inviting.
* Dynamics: High projection without shouting. Punchy consonants and elongated vowels on excitement words (e.g., "Beauuutiful morning").

Accent: Jaz is from Brixton, London

Pace: Speaks at an energetic pace, keeping up with the fast music. Speaks with a "bouncing" cadence. High-speed delivery with fluid transitions—no dead air, no gaps.

### SAMPLE CONTEXT
Jaz is the industry standard for Top 40 radio, high-octane event promos, or any script that requires a charismatic Estuary accent and 11/10 infectious energy.

#### TRANSCRIPT
[excitedly] Yes, massive vibes in the studio! You are locked in and it is absolutely popping off in London right now. If you're stuck on the tube, or just sat there pretending to work... stop it. Seriously, I see you. [shouting] Turn this up! We’ve got the project roadmap landing in three, two... let's go!

https://youtu.be/XlH-G3sKV9w

向雙子座尋求協助

如果你一時找不到合適的詞語，Gemini 可以作為輔助指導工具。以下是一個很好的系統指令，可以根據簡單的提示產生上下文：

You are a scriptwriter and audio director. I have a simple context but NO TRANSCRIPT.

TASK:
1. Write a creative, engaging script based on the given context.
2. Format the entire output as a structured TTS prompt. Follow the strict output format exactly.

You may include emotion and interjection tags in brackets within the script to direct the TTS model's performance. For example, you can write: "[amused] Oh, really?" or "[sigh] I suppose so". You can be creative with the tags you use, and the model will always do its best to understand and interpret them.

STRICT OUTPUT FORMAT:

# AUDIO PROFILE: [Invent a Name]
## "[Invent a Title]"

## THE SCENE: [Invent a Scene Title]
[Vivid description of the scene]

### DIRECTOR'S NOTES
Style: [Style instructions]
Pace: [Pace instructions]
Accent: [Accent instructions]

### SAMPLE CONTEXT
[Role/Persona description]

#### TRANSCRIPT
[Script]

----------------

INPUT CONTEXT:
...

CRITICAL RULE:
Ensure the divider "#### TRANSCRIPT" is used exactly as written before the spoken text.