XAI Router Helps You Stop Worrying About Running Out of Codex Tokens

Posted March 29, 2026 by XAI Technical Team ‐ 11 min read

If you have been using Codex recently, you probably understand one very real kind of anxiety:

Why does it look like I have to burn another huge chunk of prompt tokens every round, even when I am just continuing the same coding task?

That is not your imagination.

In a real Codex workflow, the request prefix often contains a lot of repeated content:

Long system / developer instructions
Tool definitions and tool schemas
Repository context and session state
Constraints, style rules, and task rules that keep repeating within the same job

If those repeated prefixes can hit cache consistently, the user experience stops feeling like "pay full price again every round" and starts feeling like a lower-latency, lower-cost, more sustainable coding workflow.

That is exactly what OpenAI official Prompt Caching is meant to solve, and it is also what XAI Router + Codex-Cloud have been investing in heavily.

After production rollout and validation, we have already seen in real traffic that whether the physical transport is WebSocket or HTTP, as long as the session and prefix stay stable, Codex-like requests can sustain a cache-hit share above 90%. The most direct user impact is simple: it dramatically reduces the anxiety of "running out of tokens."

Start With the Bottom Line: We Are Not "Faking Cache", We Are Making Official Cache Actually Hit

The most important sentence comes first:

XAI Router did not invent a new cache rule set that differs from OpenAI.

What we did is remove, as much as possible, the engineering factors that normally break official Prompt Caching in a multi-key gateway + compatibility-layer + long-context coding setup, so that the official cache can do its job.

That is why our core principles are very explicit:

Stay aligned with OpenAI's official technical semantics as much as possible
Avoid breaking the pass-through nature of native Codex requests
Only add stabilization where it truly affects cache hit behavior

What OpenAI Official Prompt Caching Is Actually Doing

According to OpenAI's official Prompt Caching docs, the core rules are not complicated, but they are strict:

Only an exactly identical prompt prefix can qualify for a cache hit
Static content should come first, dynamic content should come later
Images, tool definitions, and structured output schemas are all part of the cacheable prefix
The request only enters the meaningful cache window once it reaches 1024+ tokens
prompt_cache_key can help route requests with shared prefixes to machines that are more likely to already hold the cache

In other words, Prompt Caching is not "some model happens to be cheaper", and it is not "the platform guesses semantically similar requests."

It is closer to this:

Give me a stable prefix, and I will give you a stable cache hit.

OpenAI's docs also make several key points very clear:

Prompt Caching can reduce latency by up to 80% and input token cost by up to 90%
Cache hits are driven by prefix stability, routing stability, and request bucketing stability
prompt_cache_key and prefix hashing both influence routing
Cache does not change the final model output; it caches prompt prefilling, not "reusing the last answer"
Prompt cache does not cross organizations

Those details matter because they explain exactly why many proxy setups that "look like OpenAI-compatible access" still end up with underwhelming cache rates.

Why Many Ordinary Proxy Setups Do Not Achieve Great Cache Rates

In the ideal case, official single-key direct access usually delivers the best experience:

The same session naturally stays within the same account scope
Prefix changes are easier to control
Session continuity is naturally stronger

But real production products often do not look like that.

Many users actually sit behind something like:

Codex / Chat API / compatibility client -> multi-key gateway -> upstream model service

If the middle layer is not handled carefully, several common problems appear.

1. The Same Session Drifts Across Different Keys

If a multi-key gateway routes only by "which user is this" rather than "which session is this", then different turns of the same coding task may fall onto different downstream keys.

That causes:

The upstream no longer sees a continuous session
Prompt Caching becomes much harder to accumulate
The same long prefix gets prefilling work repeated again and again

2. The Compatibility Layer Disturbs the Prompt Prefix

Some compatibility layers inject long bridge prompts, unrelated developer prompts, or unstable tool-remapping text.

That may make the protocol "work", but it directly breaks the thing Prompt Caching cares about most:

prefix stability

3. The Proxy Over-Rewrites Request Semantics

Some solutions also aggressively rewrite fields such as:

store
stream
instructions
session / conversation semantics

That kind of setup can sometimes make cache metrics look nicer, but the tradeoff is:

what you get is no longer a proxy that stays as close as possible to official native semantics

What XAI Router + Codex-Cloud Actually Changed

Our goal was not to rewrite everything.

It was this:

keep official request semantics as intact as possible, while fixing the specific engineering problems that actually block cache hits

1. XAI Router: Upgrade From User-Level Stickiness to Session-Level Stickiness

At the routing layer, XAI Router no longer only cares about "who is this user". It also tries to identify:

session_id
conversation_id
prompt_cache_key

Then it applies session-aware stickiness on OpenAI-style paths such as /v1/responses, /responses, /v1/chat/completions, and /chat/completions.

That means:

Different sessions from the same user can still spread reasonably across different keys
But requests inside the same session try to stay on the same key as consistently as possible

That is an important shift.

Many older gateways were effectively doing this:

"keep one user on one key as long as possible"

We are much closer to this:

"keep one session on one key as long as possible"

For Codex workflows, the second model matters much more.

2. YAI-Sticky-Session: An Internal Session Anchor That Does Not Pollute Upstream Protocols

To make the HTTP path behave more like the continuity you naturally get from WebSocket, we introduced one internal-only header between XAI Router and Codex-Cloud:

YAI-Sticky-Session

Its job is deliberately narrow:

It acts only as an internal stable session anchor
It is not exposed to end users
It is not forwarded to the real upstream

That gives us several benefits:

The HTTP path can still pass a stable session signal to the next layer
Codex-Cloud does not need to rewrite the body just to reconstruct session continuity
It keeps a good balance between "stronger stability" and "clean protocol boundaries"

3. Codex-Cloud: Native `/responses` Stays Close to Zero-Rewrite, Compatibility Entrypoints Get the Minimum Necessary Stabilization

This is one of the design choices we care about the most.

For the native Responses API, Codex-Cloud still tries to preserve:

body passthrough
stream forwarding
no proactive rewriting of store / stream / instructions

In other words, native /responses should still look as close as possible to the original official request.

At the same time, we do add reinforcement on the parts that genuinely affect cache behavior:

header-level identity hardening
consuming YAI-Sticky-Session when session_id / conversation_id is missing
deriving a stable prompt_cache_key for compatibility entrypoints
reducing prefix pollution from bridge prompts in tool compatibility layers
passing through prompt_cache_retention

The one-sentence summary is:

native paths pass through as much as possible; compatibility paths are made as stable as possible

Why Both WebSocket and HTTP Can Now Reach Very High Cache Rates

When people first see our validation data, the first reaction is often:

Does WebSocket just hit cache more easily than HTTP by nature?

The answer is:

not because the protocol itself is magical, but because stable sessions are easier to maintain

WebSocket has natural advantages because:

A long-lived connection naturally preserves session continuity
A single connection is easier to keep pinned to the same key
previous_response_id and incremental input feel more natural inside one continuous flow

But that does not mean HTTP cannot do it.

If you also get the following parts right:

routing stays stable per session
upstream session / conversation semantics stay stable
prompt_cache_key stays stable
the prefix itself stays stable

then HTTP can achieve very high cache rates too.

That is exactly what we have observed after launch:

both WS and HTTP can push cache-hit share above 90% in long-context Codex workflows

That is why we prefer to describe XAI Router this way:

inside a multi-key architecture, it tries to approximate the experience of official single-key direct access

What Makes XAI Router Better Than Other Approaches

This table summarizes the core differences:

Approach	Cache Friendliness	Native Semantics Fidelity	Multi-Key Stability	HTTP / WS Consistency
Official single-key direct access	High	Highest	No multi-key elasticity	High
Ordinary multi-key proxy	Often mediocre	Medium	Sessions drift easily	Often unstable
Aggressive rewrite proxy	Potentially high	Lower	Medium to high	High, but intrusive
XAI Router + Codex-Cloud	High	High	High	High

Our advantage is not that we rewrite requests into something even more "optimized" than the official protocol.

It is that we:

avoid breaking the native Codex experience
let OpenAI official Prompt Caching actually do its job
support both HTTP and WebSocket cleanly

That matters if you are using Codex seriously over long periods.

Because what you need is not one beautiful cache number on one lucky request.

You need this:

the whole workflow stays stable

`https://api.xairouter.com`: Stay Aligned With Official Technical Semantics While Supporting Both WS and HTTP

Today, through https://api.xairouter.com, you can use flows that stay very close to the official OpenAI style:

POST /v1/responses
POST /v1/chat/completions
wss://api.xairouter.com/v1/responses

For WebSocket mode, we keep the key OpenAI WebSocket Mode semantics aligned:

multiple response.create calls are allowed on a single connection
execution is ordered within the same connection
only one response can be in flight at a time
no multiplexing

For users, that means:

You can keep using the HTTP workflow you already know
You can switch to WebSocket when you want stronger continuity in longer sessions
Both paths can still deliver very high cache-hit behavior

Two Codex CLI Configs You Can Use Right Now

First, prepare your environment variable:

export XAI_API_KEY="your XAI Router API Key"

Option 1: WebSocket-Oriented Config

If you prefer stronger long-session continuity, more tool-call rounds, and a workflow that stays closer to the official WebSocket pattern, use this:

model_provider = "xai"
model = "gpt-5.4"
model_reasoning_effort = "xhigh"
plan_mode_reasoning_effort = "xhigh"
model_reasoning_summary = "none"
model_verbosity = "medium"
model_context_window = 1050000
model_auto_compact_token_limit = 945000
tool_output_token_limit = 6000
approval_policy = "never"
sandbox_mode = "danger-full-access"
suppress_unstable_features_warning = true

[model_providers.xai]
name = "OpenAI"
base_url = "https://api.xairouter.com"
wire_api = "responses"
requires_openai_auth = false
env_key = "XAI_API_KEY"
supports_websockets = true

[features]
responses_websockets_v2 = true

Option 2: HTTP-Oriented Config

If you prefer the simplest possible setup and wider environment compatibility, use this:

model_provider = "xai"
model = "gpt-5.4"
model_reasoning_effort = "xhigh"
plan_mode_reasoning_effort = "xhigh"
model_reasoning_summary = "none"
model_verbosity = "medium"
model_context_window = 1050000
model_auto_compact_token_limit = 945000
tool_output_token_limit = 6000
approval_policy = "never"
sandbox_mode = "danger-full-access"

[model_providers.xai]
name = "OpenAI"
base_url = "https://api.xairouter.com"
wire_api = "responses"
requires_openai_auth = false
env_key = "XAI_API_KEY"

You can save either one as ~/.codex/config.toml and start using Codex after restarting it. If your config uses env_key = "XAI_API_KEY", you still need to define that environment variable in the OS: on Linux, ~/.bashrc is usually the right place; on macOS, prefer ~/.zshrc; on Windows, use a user environment variable and reopen the terminal. On some older macOS setups, legacy terminals, or IDE sessions that still inherit a bash login environment, also mirror the variable into ~/.bash_profile, and into ~/.bashrc if needed, so Codex can actually read it at launch.

What This Means Most Directly for Users: Less Token Anxiety, More Sustained Coding

What many people are actually anxious about is not "can the model answer this?"

It is:

long-context sessions repeatedly charging for the same prefix
prompt overhead exploding as tool chains get longer
worrying more and more about remaining token budget as the session grows

Once cache starts hitting consistently, the experience changes in ways users feel immediately:

latency drops noticeably from the second round onward
repeated prefixes stop getting billed at full cost over and over
long tasks feel safer to keep going because you are no longer constantly worrying about "how many tokens are left?"

That is what the title means:

"stop worrying about running out of Codex tokens."

That is not marketing language.

It is a very concrete engineering outcome.

Start Using It Now

If you want to:

stay highly aligned with OpenAI's official technical semantics
keep the stability and elasticity of a multi-key gateway
capture as much Prompt Caching benefit as possible in both WS and HTTP Codex workflows

then you can point Codex directly at:

https://api.xairouter.com

If you want to register or start a monthly subscription, you can go directly here:

XAI Router Monthly Subscription / Signup