XAI Router Helps You Stop Worrying About Running Out of Codex Tokens
Posted March 29, 2026 by XAI Technical Teamย โย 11ย min read
If you have been using Codex recently, you probably understand one very real kind of anxiety:
Why does it look like I have to burn another huge chunk of prompt tokens every round, even when I am just continuing the same coding task?
That is not your imagination.
In a real Codex workflow, the request prefix often contains a lot of repeated content:
- Long system / developer instructions
- Tool definitions and tool schemas
- Repository context and session state
- Constraints, style rules, and task rules that keep repeating within the same job
If those repeated prefixes can hit cache consistently, the user experience stops feeling like "pay full price again every round" and starts feeling like a lower-latency, lower-cost, more sustainable coding workflow.
That is exactly what OpenAI official Prompt Caching is meant to solve, and it is also what XAI Router + Codex-Cloud have been investing in heavily.
After production rollout and validation, we have already seen in real traffic that whether the physical transport is WebSocket or HTTP, as long as the session and prefix stay stable, Codex-like requests can sustain a cache-hit share above 90%. The most direct user impact is simple: it dramatically reduces the anxiety of "running out of tokens."
Start With the Bottom Line: We Are Not "Faking Cache", We Are Making Official Cache Actually Hit
The most important sentence comes first:
XAI Router did not invent a new cache rule set that differs from OpenAI.
What we did is remove, as much as possible, the engineering factors that normally break official Prompt Caching in a multi-key gateway + compatibility-layer + long-context coding setup, so that the official cache can do its job.
That is why our core principles are very explicit:
- Stay aligned with OpenAI's official technical semantics as much as possible
- Avoid breaking the pass-through nature of native Codex requests
- Only add stabilization where it truly affects cache hit behavior
What OpenAI Official Prompt Caching Is Actually Doing
According to OpenAI's official Prompt Caching docs, the core rules are not complicated, but they are strict:
- Only an exactly identical prompt prefix can qualify for a cache hit
- Static content should come first, dynamic content should come later
- Images, tool definitions, and structured output schemas are all part of the cacheable prefix
- The request only enters the meaningful cache window once it reaches 1024+ tokens
prompt_cache_keycan help route requests with shared prefixes to machines that are more likely to already hold the cache
In other words, Prompt Caching is not "some model happens to be cheaper", and it is not "the platform guesses semantically similar requests."
It is closer to this:
Give me a stable prefix, and I will give you a stable cache hit.
OpenAI's docs also make several key points very clear:
- Prompt Caching can reduce latency by up to 80% and input token cost by up to 90%
- Cache hits are driven by prefix stability, routing stability, and request bucketing stability
prompt_cache_keyand prefix hashing both influence routing- Cache does not change the final model output; it caches prompt prefilling, not "reusing the last answer"
- Prompt cache does not cross organizations
Those details matter because they explain exactly why many proxy setups that "look like OpenAI-compatible access" still end up with underwhelming cache rates.
Why Many Ordinary Proxy Setups Do Not Achieve Great Cache Rates
In the ideal case, official single-key direct access usually delivers the best experience:
- The same session naturally stays within the same account scope
- Prefix changes are easier to control
- Session continuity is naturally stronger
But real production products often do not look like that.
Many users actually sit behind something like:
Codex / Chat API / compatibility client -> multi-key gateway -> upstream model service
If the middle layer is not handled carefully, several common problems appear.
1. The Same Session Drifts Across Different Keys
If a multi-key gateway routes only by "which user is this" rather than "which session is this", then different turns of the same coding task may fall onto different downstream keys.
That causes:
- The upstream no longer sees a continuous session
- Prompt Caching becomes much harder to accumulate
- The same long prefix gets prefilling work repeated again and again
2. The Compatibility Layer Disturbs the Prompt Prefix
Some compatibility layers inject long bridge prompts, unrelated developer prompts, or unstable tool-remapping text.
That may make the protocol "work", but it directly breaks the thing Prompt Caching cares about most:
prefix stability
3. The Proxy Over-Rewrites Request Semantics
Some solutions also aggressively rewrite fields such as:
storestreaminstructions- session / conversation semantics
That kind of setup can sometimes make cache metrics look nicer, but the tradeoff is:
what you get is no longer a proxy that stays as close as possible to official native semantics
What XAI Router + Codex-Cloud Actually Changed
Our goal was not to rewrite everything.
It was this:
keep official request semantics as intact as possible, while fixing the specific engineering problems that actually block cache hits
1. XAI Router: Upgrade From User-Level Stickiness to Session-Level Stickiness
At the routing layer, XAI Router no longer only cares about "who is this user". It also tries to identify:
session_idconversation_idprompt_cache_key
Then it applies session-aware stickiness on OpenAI-style paths such as /v1/responses, /responses, /v1/chat/completions, and /chat/completions.
That means:
- Different sessions from the same user can still spread reasonably across different keys
- But requests inside the same session try to stay on the same key as consistently as possible
That is an important shift.
Many older gateways were effectively doing this:
"keep one user on one key as long as possible"
We are much closer to this:
"keep one session on one key as long as possible"
For Codex workflows, the second model matters much more.
2. YAI-Sticky-Session: An Internal Session Anchor That Does Not Pollute Upstream Protocols
To make the HTTP path behave more like the continuity you naturally get from WebSocket, we introduced one internal-only header between XAI Router and Codex-Cloud:
YAI-Sticky-Session
Its job is deliberately narrow:
- It acts only as an internal stable session anchor
- It is not exposed to end users
- It is not forwarded to the real upstream
That gives us several benefits:
- The HTTP path can still pass a stable session signal to the next layer
- Codex-Cloud does not need to rewrite the body just to reconstruct session continuity
- It keeps a good balance between "stronger stability" and "clean protocol boundaries"
3. Codex-Cloud: Native /responses Stays Close to Zero-Rewrite, Compatibility Entrypoints Get the Minimum Necessary Stabilization
This is one of the design choices we care about the most.
For the native Responses API, Codex-Cloud still tries to preserve:
- body passthrough
- stream forwarding
- no proactive rewriting of
store/stream/instructions
In other words, native /responses should still look as close as possible to the original official request.
At the same time, we do add reinforcement on the parts that genuinely affect cache behavior:
- header-level identity hardening
- consuming
YAI-Sticky-Sessionwhensession_id/conversation_idis missing - deriving a stable
prompt_cache_keyfor compatibility entrypoints - reducing prefix pollution from bridge prompts in tool compatibility layers
- passing through
prompt_cache_retention
The one-sentence summary is:
native paths pass through as much as possible; compatibility paths are made as stable as possible
Why Both WebSocket and HTTP Can Now Reach Very High Cache Rates
When people first see our validation data, the first reaction is often:
Does WebSocket just hit cache more easily than HTTP by nature?
The answer is:
not because the protocol itself is magical, but because stable sessions are easier to maintain
WebSocket has natural advantages because:
- A long-lived connection naturally preserves session continuity
- A single connection is easier to keep pinned to the same key
previous_response_idand incremental input feel more natural inside one continuous flow
But that does not mean HTTP cannot do it.
If you also get the following parts right:
- routing stays stable per session
- upstream session / conversation semantics stay stable
prompt_cache_keystays stable- the prefix itself stays stable
then HTTP can achieve very high cache rates too.
That is exactly what we have observed after launch:
both WS and HTTP can push cache-hit share above 90% in long-context Codex workflows
That is why we prefer to describe XAI Router this way:
inside a multi-key architecture, it tries to approximate the experience of official single-key direct access
What Makes XAI Router Better Than Other Approaches
This table summarizes the core differences:
| Approach | Cache Friendliness | Native Semantics Fidelity | Multi-Key Stability | HTTP / WS Consistency |
|---|---|---|---|---|
| Official single-key direct access | High | Highest | No multi-key elasticity | High |
| Ordinary multi-key proxy | Often mediocre | Medium | Sessions drift easily | Often unstable |
| Aggressive rewrite proxy | Potentially high | Lower | Medium to high | High, but intrusive |
| XAI Router + Codex-Cloud | High | High | High | High |
Our advantage is not that we rewrite requests into something even more "optimized" than the official protocol.
It is that we:
- avoid breaking the native Codex experience
- let OpenAI official Prompt Caching actually do its job
- support both HTTP and WebSocket cleanly
That matters if you are using Codex seriously over long periods.
Because what you need is not one beautiful cache number on one lucky request.
You need this:
the whole workflow stays stable
https://api.xairouter.com: Stay Aligned With Official Technical Semantics While Supporting Both WS and HTTP
Today, through https://api.xairouter.com, you can use flows that stay very close to the official OpenAI style:
POST /v1/responsesPOST /v1/chat/completionswss://api.xairouter.com/v1/responses
For WebSocket mode, we keep the key OpenAI WebSocket Mode semantics aligned:
- multiple
response.createcalls are allowed on a single connection - execution is ordered within the same connection
- only one response can be in flight at a time
- no multiplexing
For users, that means:
- You can keep using the HTTP workflow you already know
- You can switch to WebSocket when you want stronger continuity in longer sessions
- Both paths can still deliver very high cache-hit behavior
Two Codex CLI Configs You Can Use Right Now
First, prepare your environment variable:
export XAI_API_KEY="your XAI Router API Key"Option 1: WebSocket-Oriented Config
If you prefer stronger long-session continuity, more tool-call rounds, and a workflow that stays closer to the official WebSocket pattern, use this:
model_provider = "xai"
model = "gpt-5.4"
model_reasoning_effort = "xhigh"
plan_mode_reasoning_effort = "xhigh"
model_reasoning_summary = "none"
model_verbosity = "medium"
model_context_window = 1050000
model_auto_compact_token_limit = 945000
tool_output_token_limit = 6000
approval_policy = "never"
sandbox_mode = "danger-full-access"
suppress_unstable_features_warning = true
[model_providers.xai]
name = "OpenAI"
base_url = "https://api.xairouter.com"
wire_api = "responses"
requires_openai_auth = false
env_key = "XAI_API_KEY"
supports_websockets = true
[features]
responses_websockets_v2 = trueOption 2: HTTP-Oriented Config
If you prefer the simplest possible setup and wider environment compatibility, use this:
model_provider = "xai"
model = "gpt-5.4"
model_reasoning_effort = "xhigh"
plan_mode_reasoning_effort = "xhigh"
model_reasoning_summary = "none"
model_verbosity = "medium"
model_context_window = 1050000
model_auto_compact_token_limit = 945000
tool_output_token_limit = 6000
approval_policy = "never"
sandbox_mode = "danger-full-access"
[model_providers.xai]
name = "OpenAI"
base_url = "https://api.xairouter.com"
wire_api = "responses"
requires_openai_auth = false
env_key = "XAI_API_KEY"You can save either one as ~/.codex/config.toml and start using Codex after restarting it. If your config uses env_key = "XAI_API_KEY", you still need to define that environment variable in the OS: on Linux, ~/.bashrc is usually the right place; on macOS, prefer ~/.zshrc; on Windows, use a user environment variable and reopen the terminal. On some older macOS setups, legacy terminals, or IDE sessions that still inherit a bash login environment, also mirror the variable into ~/.bash_profile, and into ~/.bashrc if needed, so Codex can actually read it at launch.
What This Means Most Directly for Users: Less Token Anxiety, More Sustained Coding
What many people are actually anxious about is not "can the model answer this?"
It is:
- long-context sessions repeatedly charging for the same prefix
- prompt overhead exploding as tool chains get longer
- worrying more and more about remaining token budget as the session grows
Once cache starts hitting consistently, the experience changes in ways users feel immediately:
- latency drops noticeably from the second round onward
- repeated prefixes stop getting billed at full cost over and over
- long tasks feel safer to keep going because you are no longer constantly worrying about "how many tokens are left?"
That is what the title means:
"stop worrying about running out of Codex tokens."
That is not marketing language.
It is a very concrete engineering outcome.
Start Using It Now
If you want to:
- stay highly aligned with OpenAI's official technical semantics
- keep the stability and elasticity of a multi-key gateway
- capture as much Prompt Caching benefit as possible in both WS and HTTP Codex workflows
then you can point Codex directly at:
https://api.xairouter.com
If you want to register or start a monthly subscription, you can go directly here:
XAI Router Monthly Subscription / Signup