站長阿川 · 02月26日

Pattern Note #15：簡單示範 OpenAI Vision API

輸出內容

🤖 LLM 回應:
{
  "store_name": "STARBUCKS",
  "invoice_period": "115年01-02月",
  "invoice_number": "WH-83513141",
  "date": "2026-02-26",
  "items": null,
  "subtotal": null,
  "tax": null,
  "total": 150,
  "payment_method": null,
  "notes": null
}

✅ 結構化資料擷取成功！

店家: STARBUCKS
發票期別: 115年01-02月
發票號碼: WH-83513141
日期: 2026-02-26
總金額: 150

原始碼

https://github.com/howtomakeaturn/ai-agent-pattern-notes/tree/main/patterns/15-image-receipt

知識點

圖片可以用雲端 url

也可以把檔案用 base64 傳過去

知識點

一開始有 bug 你剛剛做了什麼修改明確區分這兩個欄位？

在 prompt 裡把原本的單一欄位 invoice_number 拆成兩個，並加上說明

重點是給了例子，讓 LLM 明確知道哪個是哪個，否則它看到 115年01-02月WH-83513141 這整串就會全部塞進 invoice_number。

知識點

你剛剛做了什麼修改讓 json 解析成功？

LLM 回傳的內容不是純 JSON，前後包了 markdown code block

所以加了兩行 regex 在 decode 之前把它剃掉

知識點

總金額閱讀不正確

gpt-4o gpt-4o-mini 都錯

使用 gpt-5-mini 就閱讀正確了

知識點

對了回應有 ``` 很雷吧
有嚴格結構的方法嗎

有，OpenAI 有兩種方式：

方法一：JSON Mode（簡單）

加一個參數，保證回傳純 JSON，不會有 ``` 包裹：

<?php
'response_format' => ['type' => 'json_object'],

方法二：Structured Outputs（嚴格，推薦）

直接定義 JSON Schema，不只格式正確，連欄位、型別都完全鎖死：

<?php
'response_format' => [
    'type' => 'json_schema',
    'json_schema' => [
        'name' => 'receipt',
        'strict' => true,
        'schema' => [
            'type' => 'object',
            'properties' => [
                'store_name'     => ['type' => ['string', 'null']],
                'invoice_period' => ['type' => ['string', 'null']],
                'invoice_number' => ['type' => ['string', 'null']],
                'date'           => ['type' => ['string', 'null']],
                'total'          => ['type' => ['number', 'null']],
            ],
            'required' => ['store_name', 'invoice_period', 'invoice_number', 'date', 'total'],
            'additionalProperties' => false,
        ],
    ],
],

方法二的好處是連 $data['total'] 型別都保證是 number，不會有意外。

知識點

所以所謂 vision api
就是把以前的 user prompt 從 'content' => 'xxxx'
變成 'content' => [...]
純文字變成陣列是吧

對，就這樣而已。

其他都一樣，model、messages 結構全部不變。Vision 沒有什麼特殊 API，就只是 content 從 string 變成陣列，每個元素可以是 text 或 image_url。

多張圖片也可以囉？

對，陣列裡塞多個 image_url 就好：

<?php
'content' => [
    ['type' => 'image_url', 'image_url' => ['url' => '...圖片1...']],
    ['type' => 'image_url', 'image_url' => ['url' => '...圖片2...']],
    ['type' => 'image_url', 'image_url' => ['url' => '...圖片3...']],
    ['type' => 'text', 'text' => '比較這三張收據的金額'],
]

實際應用像是：一次上傳整個月的收據照片，讓 LLM 全部解析後彙整成報表。

多個 text 呢？還是沒意義

技術上可以，但沒意義，LLM 會把它們當成連續文字讀，等同於你把文字合併成一個 text。

唯一有點實用的場景是圖文交錯排列，暗示對應關係：

<?php
'content' => [
    ['type' => 'text',      'text' => '這是一月的收據：'],
    ['type' => 'image_url', 'image_url' => ['url' => '...一月...']],
    ['type' => 'text',      'text' => '這是二月的收據：'],
    ['type' => 'image_url', 'image_url' => ['url' => '...二月...']],
    ['type' => 'text',      'text' => '請比較兩個月的消費差異'],
]

這樣 LLM 比較容易知道哪個說明對應哪張圖。但一般情況下，一個 text 就夠了。

知識點

真正的 Image RAG 比較複雜

把圖片本身做 embedding（multimodal embedding），檢索時可以用文字查圖片，或用圖片查圖片。例如：「找出跟這張圖最像的產品」。

OpenAI 目前沒有開放 multimodal embedding，要用的話得換 CLIP 或 Google Vertex AI。

站長阿川

⭐️ Shopify 網站開發服務（給品牌）
https://job.turn.tw/shopify-services

⭐️ 小網站開發服務（功能明確、規模不大的需求）
https://job.turn.tw/small-website-services

⭐️ 台灣 Shopify 商家交流 LINE 群（非官方）
https://line.me/ti/g2/PZ_1LILWVWWuzZQ50HNpYA-A3k6QXWF6znqoBQ

⭐️ 台灣 Shopify 開發者 LINE 群（非官方）
https://line.me/ti/g2/YUasX5K3CJ4QdIx76zppjHlh3-q8w-xkSyK1LA