Image Understanding Guide: One Setup for Chat API, Responses API, and Claude API

Posted March 24, 2026 by The XAI Tech Teamย โ€ย 6ย min read

If you already know how to call a model with text, adding image understanding to your system usually does not require a different model first. What changes most often is the request shape. In the official OpenAI docs, Chat Completions and Responses use different field names for image input. In the official Anthropic docs, Claude Messages uses another content-block structure. The good news is that on XAI Router, all three request styles can be routed to the same class of vision-capable models, such as gpt-5.4 or gpt-5.4-mini.

This guide does one thing: it rewrites the official OpenAI and Claude image-input examples into versions you can send directly to https://api.xairouter.com. For consistency, the main examples below use gpt-5.4 by default. If you want lower cost, you can replace the model value with gpt-5.4-mini.


One-line takeaway

  • If you already use an OpenAI-compatible chat client, use /v1/chat/completions
  • If you want the newer OpenAI interface for multimodal workflows, use /v1/responses
  • If your existing system is Claude / Anthropic compatible, use /v1/messages
  • All three styles can target a vision-capable model such as gpt-5.4 or gpt-5.4-mini

What is actually different between the three APIs?

APIEndpointText blockImage blockBest fit
Chat API/v1/chat/completionstype: "text"type: "image_url"Existing OpenAI Chat-compatible clients
Responses API/v1/responsestype: "input_text"type: "input_image"New projects and unified multimodal input
Claude API/v1/messagestype: "text"type: "image" + sourceAnthropic / Claude-compatible clients

All examples in this guide use public image URLs. If you later switch to Base64, keep following each API's official field structure instead of copying image fields from one API shape into another.


Set one environment variable first

export XAI_API_KEY="sk-..."

Replace the image URL in the examples below with your own. For consistency, all single-image examples use this image:

https://filelist.cn/disk/0/1.jpg

Option A: Chat API

If you already have an OpenAI-compatible chat integration, this is the lowest-change path. Your current example is already on the right track. Usually the main improvement is to make the prompt more explicit.

curl https://api.xairouter.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $XAI_API_KEY" \
  -d '{
    "model": "gpt-5.4",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "Analyze this image and return 3 points: 1. main subject; 2. important details; 3. likely context."
          },
          {
            "type": "image_url",
            "image_url": {
              "url": "https://filelist.cn/disk/0/1.jpg"
            }
          }
        ]
      }
    ],
    "max_tokens": 300
  }'

If you want to keep your current style, that is also fine. You can change the prompt back to something short like "Analyze this" and it will still work. If you want a lighter model, replace gpt-5.4 with gpt-5.4-mini.


Option B: Responses API

If you want to unify image input, tool use, and future multimodal workflows behind the newer OpenAI interface, use /v1/responses. The two main differences are:

  1. messages becomes input
  2. The text block and image block become input_text and input_image
curl https://api.xairouter.com/v1/responses \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $XAI_API_KEY" \
  -d '{
    "model": "gpt-5.4",
    "input": [
      {
        "role": "user",
        "content": [
          {
            "type": "input_text",
            "text": "Analyze this image and return 3 points: 1. main subject; 2. important details; 3. likely context."
          },
          {
            "type": "input_image",
            "image_url": "https://filelist.cn/disk/0/1.jpg"
          }
        ]
      }
    ],
    "max_output_tokens": 300
  }'

If your application will later use tools, structured outputs, or more complex multimodal chains, Responses is usually the better primary entry point. If cost matters more, switch the model to gpt-5.4-mini.


Option C: Claude API

If your existing client is built around Anthropic SDKs, Claude Code-style clients, or a Claude-compatible gateway, the image block changes again. This is not image_url. It uses type: "image" plus source.

curl https://api.xairouter.com/v1/messages \
  -H "Content-Type: application/json" \
  -H "x-api-key: $XAI_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -d '{
    "model": "gpt-5.4",
    "max_tokens": 300,
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "image",
            "source": {
              "type": "url",
              "url": "https://filelist.cn/disk/0/1.jpg"
            }
          },
          {
            "type": "text",
            "text": "Analyze this image and return 3 points: 1. main subject; 2. important details; 3. likely context."
          }
        ]
      }
    ]
  }'

If you use an official Anthropic SDK, you usually only need to point baseURL or base_url to https://api.xairouter.com and change the model name to the target model you want, such as gpt-5.4 or gpt-5.4-mini.


Does it support multiple images?

Yes. The key difference is not whether multiple images are supported, but how each API appends more image blocks into the array:

  • Chat API: add more image_url blocks inside content
  • Responses API: add more input_image blocks inside content
  • Claude API: add more image + source blocks inside content

The most useful multi-image patterns are usually not "show the model more images" in the abstract. They are more concrete tasks such as:

  • comparing two screenshots
  • checking consistency across product photos
  • summarizing content across multiple photographed pages
  • describing each image first, then giving a combined conclusion

Here are three minimal working examples.

Multi-image example A: Chat API

curl https://api.xairouter.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $XAI_API_KEY" \
  -d '{
    "model": "gpt-5.4",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "Describe these two images separately, then summarize the most obvious differences between them."
          },
          {
            "type": "image_url",
            "image_url": {
              "url": "https://filelist.cn/disk/0/1.jpg"
            }
          },
          {
            "type": "image_url",
            "image_url": {
              "url": "https://filelist.cn/disk/0/2.jpg"
            }
          }
        ]
      }
    ],
    "max_tokens": 400
  }'

Multi-image example B: Responses API

curl https://api.xairouter.com/v1/responses \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $XAI_API_KEY" \
  -d '{
    "model": "gpt-5.4",
    "input": [
      {
        "role": "user",
        "content": [
          {
            "type": "input_text",
            "text": "Describe these two images separately, then summarize the most obvious differences between them."
          },
          {
            "type": "input_image",
            "image_url": "https://filelist.cn/disk/0/1.jpg"
          },
          {
            "type": "input_image",
            "image_url": "https://filelist.cn/disk/0/2.jpg"
          }
        ]
      }
    ],
    "max_output_tokens": 400
  }'

Multi-image example C: Claude API

curl https://api.xairouter.com/v1/messages \
  -H "Content-Type: application/json" \
  -H "x-api-key: $XAI_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -d '{
    "model": "gpt-5.4",
    "max_tokens": 400,
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "image",
            "source": {
              "type": "url",
              "url": "https://filelist.cn/disk/0/1.jpg"
            }
          },
          {
            "type": "image",
            "source": {
              "type": "url",
              "url": "https://filelist.cn/disk/0/2.jpg"
            }
          },
          {
            "type": "text",
            "text": "Describe these two images separately, then summarize the most obvious differences between them."
          }
        ]
      }
    ]
  }'

Three practical tips for multi-image input

  1. Do not just say "look at these images". Ask for a structure such as "describe each image first, then give one combined conclusion".
  2. As you add more images, input tokens, latency, and cost all go up.
  3. If you use the Claude-compatible request shape, put image blocks first and the text instruction after them.

Which one should you choose?

  • Choose Chat API if you want the smallest possible change for an existing OpenAI-compatible client
  • Choose Responses API if you are building a newer multimodal or tool-using workflow
  • Choose Claude API if you already have Anthropic / Claude-compatible clients or gateways

The important question is not which one is more "advanced". The important question is which request dialect your client already speaks. Let XAI Router absorb as much compatibility work as possible instead of forcing your application layer to rewrite everything.


Common pitfalls

1) Image field names are not interchangeable

  • Chat API is not input_image
  • Responses API is not an image_url object
  • Claude API is not image_url; it is image + source

That is the most common reason people get parameter errors even though they think they are "just sending an image".

2) Prefer image URLs the model can access directly

If your image already lives in object storage, a CDN, or another public URL, using the URL is the simplest path and the closest to the official examples. Once the end-to-end path works, you can add Base64, uploads, or file reuse later.

3) Be more specific than "analyze this"

For image understanding, the model changes its level of detail based on the output format you ask for. If you want more stable results, specify a structure such as "subject / details / context / risks / OCR text".


Your current example is already valid

If you already have something like this:

curl https://api.xairouter.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $XAI_API_KEY" \
  -d '{
    "model": "gpt-5.4-mini",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "Analyze this"
          },
          {
            "type": "image_url",
            "image_url": {
              "url": "https://filelist.cn/disk/0/1.jpg"
            }
          }
        ]
      }
    ],
    "max_tokens": 300
  }'

then you are already on the right path. Keeping gpt-5.4-mini is fine. If you want a stronger model, changing the model value to gpt-5.4 works as well. The next decision is not about locking yourself to one model name. It is about whether you want to stay on Chat API, standardize on Responses API, or expose /v1/messages for Anthropic / Claude-compatible clients.


References