🔧 阿川の電商水電行
Shopify 顧問、維護與客製化
💡
小任務 / 單次支援方案
單次處理 Shopify 修正/微調
⭐️
維護方案
每月 Shopify 技術支援 + 小修改 + 諮詢
🚀
專案建置
Shopify 功能導入、培訓 + 分階段交付

作為開發者,我們習慣於透過串聯 API 來獲得所需的輸出。在生成式人工智慧領域,也出現了類似的模式:模型串聯。

製作高品質的AI影片通常需要精心設計工作流程,而不僅僅是輸入文字然後點擊「生成!」按鈕。今天,我將詳細介紹一個特定的技術堆疊——Gemini 2.5 Pro(用於推理/提示)、NanoBanana(用於基礎圖像生成)和Veo 3.1(用於圖像轉影片),以模擬一個超逼真的門鈴監控攝像頭畫面,畫面中一隻非常可愛的耳廓狐正在玩樂高積木。

下面詳細介紹了我們如何從零開始製作出一個完整的影片,我在Google AI Studio中使用的提示,以及對生成的影片輸出的評價。讓我們開始吧! :smile:

模型鏈

  1. Gemini 。用於分析視覺美學,並產生圖像和視訊生成模型所需的複雜提示。

  2. NanoBanana 。用於產生初始靜態影像資源,縱向模式(9:16)。

  3. Veo 3.1 Fast 。用於將物理效果和運動效果應用於靜態影像資源,也支援縱向模式(9:16)。


第一階段:基礎影像

影像生成中最困難的部分在於如何掌握「氛圍」並保持人物風格的一致性。在這個例子中,我想要一種特定的媒介——從家庭門鈴視角拍攝的顆粒感較強的夜視監視器畫面。

我沒有讓它猜測關鍵字,而是讓 Gemini 扮演提示工程師的角色,來完成這項任務。我提供了概念(“耳廓狐、樂高、夜晚、門鈴相機”),並讓它為圖像模型編寫提示。

雙子座產生的提示:

A grainy, low-quality doorbell camera snapshot at night. Infrared night vision aesthetic with a slight monochromatic green tint. A wide-angle fisheye lens view looking down at a front porch welcome mat. A cute fennec fox with large ears is sitting on the mat, looking up at the camera with glowing reflective eyes. The fox is surrounded by scattered LEGO bricks. The LEGO bricks are arranged on the floor to clearly spell out the word "HI :)" in block letters. Digital overlay text in the corner says "FRONT DOOR - LIVE" and the current timestamp.

為什麼這種方法有效:

  • 偽影注入:諸如“顆粒感”、“低品質”和“單色綠調”之類的詞語可以防止模型將圖像處理得過於乾淨或藝術化。它透過不完美來強化真實感。

  • 相機規格:指定「魚眼鏡頭」和「向下拍攝」可確保 Ring/Nest 相機特有的正確透視變形。

結果:

NanoBanana 輸出的靜態影像近乎完美。光線均勻(紅外線影像的典型特徵),眼睛發出光芒(逆反射),角度明顯是“門鈴”角度。

圖片描述


第二階段:動畫製作

如果你只是簡單地告訴視訊模型“讓它動起來”,所有模型都傾向於胡亂地晃動鏡頭或改變主體形狀。你需要提供指導。為此,我將靜態圖像輸入到 Gemini 中,並讓它產生動畫提示。在查看範例提示後,我選擇了一個專注於互動和物理效果的提示。

影片提示:

The cute fennec fox looks down from the camera towards the LEGO bricks on the mat. It gently extends one front paw and nudges a loose LEGO brick near the "HI", sliding it slightly across the mat. The fox then looks back up at the camera with a playful, innocent expression. Its ears twitch. The camera remains static.

我將此提示和靜態圖像輸入到Veo 3.1 Fast


第三階段:分析 Veo 輸出

讓我們看一下生成的視訊文件,並根據提示分析其執行情況:

{%twitter https://x.com/DynamicWebPaige/status/1989549550046986570 %}

勝利

  1. 時間一致性(光線和紋理):
The most impressive aspect is the consistency of the night-vision texture. The "grain" doesn't shimmer uncontrollably, and the monochromatic green remains stable throughout the 7 seconds. The fur texture on the fox changes naturally as it moves, rather than boiling or morphing.
  1. “魚眼”效果:
Veo 3.1 respected the distortion of the original image. When the fox leans down and back up, it moves *within* the 3D space of that distorted lens. It doesn't flatten out.
  1. 耳部動態:
The prompt specifically asked for "ears twitch." Veo nailed this. The ears move independently and reactively, which is a critical trait of fennec foxes. This adds a layer of biological realism to the generated movement.
  1. 攝影機鎖定:
The prompt specified "The camera remains static." This is crucial. Early video models often added unnecessary pans or zooms. Veo kept the frame locked, reinforcing the "mounted security camera" aesthetic.

漏洞

  1. 物體恆存性(樂高積木):
While the prompt asked the fox to "nudge a loose LEGO," the model struggled with rigid body physics. Instead of a clean slide, the LEGOs near the paws tend to morph or "melt" slightly as the fox interacts with them. The "HI" text also loses integrity, shifting into abstract shapes by the end of the clip.
  1. 動作解讀:
The prompt asked for a gentle paw extension. The model interpreted this more as a "pounce" or a head-dive. The fox dips its whole upper body down rather than isolating the paw. While cute, it’s a deviation from the specific articulation requested.
  1. 文本疊加(OCR 幻覺):
The original image had a crisp timestamp. As soon as motion begins, the text overlay ("FRONT DOOR - LIVE") becomes unstable. Video models still struggle to keep text overlays static while animating the pixels behind them. The timestamp blurs and fails to count up logically.
  1. 「歡迎」墊:
If you look closely at the mat, the text (presumably "WELCOME") is geometrically inconsistent. As the fox moves over it, the letters seem to shift their orientation slightly, revealing that the model treats the mat as a texture rather than a flat plane in 3D space.

太長不看

使用像 Gemini 這樣的 LLM 來產生媒體模型提示,可以大大提高效率!雖然 Veo 3.1 Fast 對光照、紋理和生物運動(例如耳朵!)的理解令人驚嘆,但它——像所有當前的視訊模型一樣——在處理剛體互動(例如樂高積木)和靜態文字疊加方面仍然面臨挑戰。

快速提示:在文字轉影像階段,務必具體說明拍攝角度和光線。在影片階段,提示重點應放在拍攝物件的動作上,但背景物體可以有一定的流動性。建議使用 Gemini 2.5 Pro 來輔助提示。


原文出處:https://dev.to/googleai/chaining-veo-31-and-nanobanana-with-gemini-3ffi


精選技術文章翻譯,幫助開發者持續吸收新知。

共有 0 則留言


精選技術文章翻譯,幫助開發者持續吸收新知。
🏆 本月排行榜
🥇
站長阿川
📝13   💬4   ❤️4
411
🥈
我愛JS
📝1   💬2   ❤️2
43
🥉
酷豪
1
評分標準:發文×10 + 留言×3 + 獲讚×5 + 點讚×1 + 瀏覽數÷10
本數據每小時更新一次
🔧 阿川の電商水電行
Shopify 顧問、維護與客製化
💡
小任務 / 單次支援方案
單次處理 Shopify 修正/微調
⭐️
維護方案
每月 Shopify 技術支援 + 小修改 + 諮詢
🚀
專案建置
Shopify 功能導入、培訓 + 分階段交付