CheapestInference | Blog

Qwen 3.5 vs GPT-5.4 vs Claude Opus 4.6 — same quality, fraction of the price

Thu, 26 Mar 2026 00:00:00 GMT

You asked for this. After our first benchmark post, the most requested model was Qwen 3.5. Here it is — 4 models across 5 metrics, same models in every chart:

Open-source: Qwen3.5-397B-A17B (flagship), Qwen3.5-35B-A3B (efficient) Proprietary: GPT-5.4, Claude Opus 4.6

Knowledge: MMLU-Pro (%)

GPT-5.4

88.5%

Qwen3.5 397B

87.8%

Qwen3.5 35B

85.3%

Claude Opus 4.6

82.0%

GPT-5.4 leads at 88.5%, but Qwen3.5-397B is 0.7 points behind — statistically noise. The 35B with only 3B active parameters scores 85.3%, beating Opus by 3 points. The total spread across all four models is just 6.5 points.

Qwen3.5-397B matches GPT-5.4 at 5x less cost. The 35B beats Opus at 23x less.

Reasoning: GPQA Diamond (%)

GPT-5.4

92.0%

Claude Opus 4.6

91.3%

Qwen3.5 397B

88.4%

Qwen3.5 35B

84.2%

Proprietary models lead on graduate-level reasoning. GPT-5.4 at 92% and Opus at 91.3% are strong. But Qwen3.5-397B at 88.4% is within 4 points — and costs $0.54/M vs $2.50 and $5.00. The 35B at 84.2% is still PhD-level performance for $0.22/M input.

Code: LiveCodeBench v6 (%)

GPT-5.4

84.0%

Qwen3.5 397B

83.6%

Claude Opus 4.6

76.0%

Qwen3.5 35B

74.6%

The 397B essentially ties GPT-5.4 on competitive coding — 0.4 points apart. Both beat Opus by 8+ points. The 35B at 74.6% is within 2 points of Opus, at 1/23rd the price.

For dedicated coding workloads, we also serve Qwen3-Coder-480B (SWE-bench Verified: 69.6%, comparable to Claude Sonnet 4).

Speed: output tokens per second

Qwen3.5 35B

178 t/s

Qwen3.5 397B

84 t/s

GPT-5.4

~78 t/s

Claude Opus 4.6

46 t/s

The 35B’s MoE architecture pays off — 178 tok/s is 2.3x faster than GPT-5.4 and 3.9x faster than Opus. Even the 397B flagship at 84 tok/s outpaces both proprietary models. This is what happens when only 3-17B parameters activate per token instead of the full model.

Speed data from Artificial Analysis. Actual speeds on our infrastructure may differ.

Price: input cost per million tokens

Qwen3.5 35B

$0.22

Qwen3.5 397B

$0.54

GPT-5.4

$2.50

Claude Opus 4.6

$5.00

This is the chart that matters. Opus costs 23x more than the 35B and 9x more than the 397B. GPT-5.4 costs 5x more than the 397B. The quality difference? Single-digit percentage points on every benchmark.

The full picture

Quality only — no price axis. GPT-5.4 (gray) has the largest shape. Opus (dashed) is strong on reasoning and code. The 397B (indigo) nearly overlaps GPT-5.4 on code and knowledge. The 35B (teal) pulls hard left on speed — 178 tok/s is 2.3x faster than anything else here. Price tells its own story in the chart above.

The scorecard

Metric	Winner	Qwen3.5 397B	GPT-5.4	Claude Opus 4.6	Gap (397B vs best)
Knowledge (MMLU-Pro)	GPT-5.4	87.8%	88.5%	82.0%	-0.7 pts
Reasoning (GPQA)	GPT-5.4	88.4%	92.0%	91.3%	-3.6 pts
Code (LiveCodeBench)	GPT-5.4	83.6%	84.0%	76.0%	-0.4 pts
Speed (tok/s)	Qwen3.5 397B	84 t/s	~78 t/s	46 t/s	1.1x faster
Price ($/M input)	Qwen3.5 397B	$0.54	$2.50	$5.00	4.6x cheaper

Same weight class, different price tag. The 397B trades 0.4–3.6 points on quality for 4.6x lower price and faster speed. It beats Opus on 4 out of 5 metrics outright.

Note: The Qwen3.5-35B-A3B ($0.22/M) scores 85.3% MMLU-Pro, 84.2% GPQA, 74.6% LiveCodeBench at 178 tok/s — beating Opus on knowledge and speed at 23x less cost. A different weight class, but worth considering if speed and price matter more than the last few quality points.

The real question: what are you paying for?

The quality gap between Qwen3.5-397B and GPT-5.4 is 0.7 points on knowledge, 0.4 points on code. The price gap is 4.6x.

Put it differently:

Model	MMLU-Pro	Cost per quality point
Qwen3.5 35B	85.3%	$0.003 per point per M tokens
Qwen3.5 397B	87.8%	$0.006 per point per M tokens
GPT-5.4	88.5%	$0.028 per point per M tokens
Claude Opus 4.6	82.0%	$0.061 per point per M tokens

Opus costs 20x more per quality point than the 35B — and scores lower. GPT-5.4 leads on quality but costs 5-10x more for single-digit advantages.

For most workloads, the last 3% of benchmark performance isn’t worth a 5x price increase. And for workloads where it is — the 397B gets you within 1 point of GPT-5.4 at a fraction of the cost.

Also available: specialized Qwen models

Beyond the general-purpose models, we serve two Qwen specialists:

Qwen3-Coder-480B — SWE-bench Verified 69.6%, comparable to Claude Sonnet 4. Built for agentic coding.
Qwen3-235B-Thinking — Chain-of-thought reasoning specialist. When you need the model to show its work.

Both available through the same API, same flat-rate plans.

All Qwen 3.5 models are available now on our API. Flat rate from $20/mo, or pay-as-you-go credits. See pricing and try it →

Sources: Qwen3.5-397B Model Card · Qwen3.5-35B Model Card · Artificial Analysis Leaderboard · GPQA Diamond Leaderboard · OpenAI Pricing · Anthropic Pricing · LiveCodeBench Leaderboard

OpenClaw is free. Running it is not.

Tue, 24 Mar 2026 00:00:00 GMT

OpenClaw has 247,000 GitHub stars. It’s free, open-source, and runs locally. You install it, point it at an LLM, and it writes code, browses the web, queries databases, and executes files on your behalf.

The agent is free. The inference is not.

Every time OpenClaw calls a model, it re-sends the entire conversation history — every tool output, every file it read, every intermediate result. By iteration 20 of a typical task, the input context is 30,000+ tokens. By iteration 40, it’s past 100,000. And it sends this every single request.

This is not a bug. It’s how agents work. And it’s why running OpenClaw on pay-per-token APIs costs $300–600/month for active users — sometimes more.

Where the tokens go

We broke down token consumption for a typical OpenClaw coding task: “add authentication to an Express API.” The agent completed it in 38 tool calls.

Context accumulation

~280K tokens

System prompt (×38)

~156K tokens

Tool outputs (files, etc.)

~70K tokens

Agent output

~19K tokens

Total: ~525,000 tokens for a single task. The agent’s actual output — the code it wrote — was 19K tokens. The other 96% is overhead.

On Claude Opus at $15/M input + $75/M output, that single task costs $9.18. Run five tasks a day and you’re at $1,377/month.

On DeepSeek V3.2 via a pay-per-token provider at $0.27/M input + $1.10/M output, the same task costs $0.16. Better — but 20 tasks a day is still $96/month, and that’s one agent.

The three cost traps

We covered these in depth in Why your AI agent needs a budget, but here’s the OpenClaw-specific version:

1. Context grows quadratically

OpenClaw reads files into context. If it reads a 2,000-token file at step 5, that file gets re-sent at steps 6, 7, 8… all the way to 38. That single file read costs 2,000 × 33 remaining steps = 66,000 tokens in re-transmission alone.

Users report session contexts at 56–58% of the 400K context window during normal use. This isn’t a failure mode — it’s the architecture working as designed.

2. System prompt is a fixed tax

OpenClaw’s system prompt is ~9,600 tokens. It gets sent with every request. Over 38 tool calls, that’s 365K tokens just in system prompts. You pay this whether the agent does useful work or not.

3. Wrong model for the job

OpenClaw defaults to a single model for everything. But not every tool call needs the same intelligence:

Reading a file and deciding what to edit? Llama 3.1 8B handles this at 200 tokens/sec.
Writing complex authentication logic? DeepSeek V3.2 or Kimi K2.5 is the right call.
Formatting a config file? Any 8B model is overkill but still cheaper than Opus.

We wrote a full guide on this pattern: Building a multi-model architecture. Routing agent requests to the right model can cut costs by 60–80% without reducing output quality.

The math on flat-rate vs. pay-per-token

Here’s the comparison for an OpenClaw user running ~20 tasks/day:

Provider	Cost/task	20 tasks/day	Monthly
Claude Opus (direct)	$9.18	$183.60	$5,508
GPT-5.4 (direct)	$4.73	$94.60	$2,838
DeepSeek V3.2 (per-token)	$0.16	$3.20	$96
CheapestInference Pro	—	—	$50/mo flat

Flat-rate means you don’t care about context accumulation. The 280K tokens of context overhead that makes pay-per-token expensive? Irrelevant. The system prompt tax? Doesn’t matter. Your agent can call models 24/7 and the bill is the same.

If you’re running OpenClaw, here’s the setup we see working best:

1. Use open-source models. DeepSeek V3.2 and Kimi K2.5 score within 4 points of proprietary models on coding benchmarks (the data). The gap doesn’t justify a 50x cost difference.

2. Route by complexity. Don’t send file reads and simple decisions to the same model as complex code generation. A router model costs fractions of a cent per classification. Full guide: Multi-model architecture.

3. Set per-key budgets. One API key per agent, each with a dollar-denominated budget that resets every few hours. When the budget runs out, the agent pauses instead of burning through your allocation. We built this into every key: Agent budgets explained.

4. Handle rate limits automatically. Budget caps mean your agent will hit 429s. That’s the point — the cap is working. But OpenClaw kills the conversation when it gets a 429. The agent stops, and if you close the dashboard, that conversation is gone.

We built an OpenClaw plugin that fixes this: openclaw-ratelimit-retry. It hooks into agent_end, detects retriable 429s, parks the session on disk, and waits for the budget window to reset. Then it sends chat.send to the original session — resuming the conversation with its full transcript, as if you had typed a message.

openclaw plugins install @cheapestinference/openclaw-ratelimit-retry

~/.openclaw/config.yaml

plugins:
  ratelimit-retry:
    budgetWindowHours: 5    # matches your CheapestInference budget reset
    maxRetryAttempts: 3     # give up after 3 consecutive 429s
    checkIntervalMinutes: 5 # check every 5 min for ready retries

The plugin is zero-dependency, persists across server restarts, deduplicates by session, and handles edge cases like sub-agents, queue overflow, and corrupted state files. If the retry itself hits a 429, it re-queues automatically. No tokens wasted on re-sending from scratch — the agent picks up exactly where it left off.

This turns budget caps from “your agent crashes” into “your agent naps and wakes up.” Set it up once and forget about it.

5. Consider flat-rate. If your agent runs more than a few tasks per day, per-token pricing works against you. Every token of context overhead is money. On flat-rate, context overhead is free — use the full 128K window, re-send everything, let the agent work without constraint.

The irony

OpenClaw is free because the code runs on your machine. But the valuable part — the intelligence — runs on someone else’s GPUs. The agent framework is the cheap part. Inference is the expensive part.

Open-source models on flat-rate infrastructure flip this equation. The models are free. The inference is flat. The only variable cost left is your time.

Point your OpenClaw base_url at https://api.cheapestinference.com/v1 and find out what unconstrained agents actually cost: nothing more than you already budgeted.

Why your AI agent needs a budget

Tue, 24 Mar 2026 00:00:00 GMT

There’s a pattern that plays out every week in AI Discord servers and GitHub issues: someone deploys an agent, goes to bed, and wakes up to a $400 bill from a loop that ran all night.

Agents are not humans. They don’t get tired. They don’t notice when they’re repeating themselves. And they consume tokens at a rate that makes interactive chat look like a rounding error.

If you’re running agents in production — or even in development — you need a budget. Here’s why, and how to implement one.

Agents consume 10–50x more tokens than humans

A human chatting with an LLM sends a message, reads the response, thinks, types another message. Maybe 10 requests per hour, a few hundred tokens each.

An agent running a tool loop does this:

1. Read task description (system prompt + context)     → 4,000 tokens input
2. Call tool #1                                         → 500 tokens output
3. Receive tool result, re-send full context + result   → 5,200 tokens input
4. Call tool #2                                         → 500 tokens output
5. Receive result, re-send everything                   → 6,800 tokens input
6. ... repeat 20-40 times ...

Each iteration re-sends the entire conversation history. By step 20, the input context is 30,000+ tokens — and the agent sends it every single time. A 40-step agent loop can consume 500,000+ tokens in a single task. That’s what a human user consumes in a week.

Agent (40-step loop)

~500K tokens

Agent (10-step loop)

~100K tokens

Human (1 hour chat)

~10K tokens

This is normal behavior. The agent is doing its job. The problem is when it does its job wrong — and nobody is watching.

The three failure modes that drain budgets

1. Infinite tool loops

The agent calls a tool, gets an error, retries the same call, gets the same error, retries again. Without a loop detector or retry cap, this continues until your rate limit or budget hits zero.

This is the most common failure mode. It happens when:

An API the agent calls is temporarily down
The agent’s output doesn’t match the tool’s expected input format
The agent misinterprets the tool result and keeps “trying harder”

A single infinite loop can consume millions of tokens in minutes.

2. Context accumulation

Every tool result gets appended to the conversation. The agent never summarizes or trims. By step 30, the input payload is 40K+ tokens, and most of it is irrelevant tool outputs from step 3.

This isn’t a bug — it’s the default behavior of most agent frameworks. The context grows linearly with each step, and each step costs more than the last because the full context is re-sent.

3. Wrong model for the job

An agent using DeepSeek R1 (a reasoning model at ~30 tokens/second) for tasks that don’t require reasoning — file listing, simple classification, template generation — is burning expensive compute for no quality gain. R1 also produces internal chain-of-thought tokens that you pay for but never see.

The fix is model routing — covered in our multi-model architecture guide. But even with routing, you need a budget as a backstop.

What happens without a budget

Without a spending cap, any of these failures means:

Pay-as-you-go API: The bill grows until you notice. Stories of $500+ surprise bills are common on forums. The provider has no reason to stop you — they’re selling tokens.
Self-hosted inference: The agent consumes your entire GPU allocation, starving other workloads.
Shared platform: One user’s agent consumes capacity that other users need.

In all three cases, the damage scales with time. An agent that runs for 8 hours unattended can do 8 hours of damage.

How budget caps work

A budget cap is a dollar ceiling on how much a single key can spend in a time window. When the cap is reached, requests return a 429 Too Many Requests error. No overage charges. No surprise bills. The agent stops, and you investigate.

The key properties of a good budget system:

1. Dollar-denominated, not token-denominated.

Token limits sound intuitive but don’t work across models. 100,000 tokens of Llama 3.1 8B costs $0.002. The same tokens on a large reasoning model costs 100x more. A dollar budget normalizes across all models automatically.

2. Time-windowed with automatic reset.

A budget that resets every few hours (e.g. every 5 hours) means a failure in one window doesn’t affect the next. The agent recovers automatically. If you set a one-time budget that never resets, you have to manually intervene every time the agent exhausts it.

3. Per-key, not per-account.

If you run 5 agents, each should have its own key and its own budget. One runaway agent should not starve the other four. Per-key budgets provide isolation — the same way containers isolate processes.

Designing agents that handle budget limits gracefully

A well-built agent treats a budget limit the same way a well-built web app treats a rate limit — as a normal operational condition, not an unexpected error.

Catch 429s and degrade

from openai import OpenAI, RateLimitError

client = OpenAI(
    base_url="https://api.cheapestinference.com/v1",
    api_key="sk_your_agent_key"
)

def agent_step(messages: list) -> str:
    try:
        response = client.chat.completions.create(
            model="deepseek/deepseek-chat-v3-0324",
            messages=messages
        )
        return response.choices[0].message.content
    except RateLimitError:
        # Budget exhausted — save state, wait for reset
        save_agent_state(messages)
        return "[BUDGET_LIMIT] Agent paused. Will resume on next window."

Monitor spend proactively

Don’t wait for the 429. Check your remaining budget periodically and adjust behavior:

import requests

def check_budget(api_key: str) -> dict:
    """Check remaining budget via the usage endpoint."""
    resp = requests.get(
        "https://api.cheapestinference.com/v1/usage",
        headers={"Authorization": f"Bearer {api_key}"}
    )
    return resp.json()["budget"]

budget = check_budget("sk_your_agent_key")
remaining = budget["limit"] - budget["spent"]

if remaining < 0.01:
    # Less than $0.01 left — switch to cheapest model or pause
    switch_to_model("meta-llama/llama-3.1-8b-instruct")

Set retry caps in your agent framework

Every agent framework has a way to limit retries. Use it:

# LangChain
agent = create_react_agent(
    llm=llm,
    tools=tools,
    max_iterations=25  # Hard cap on tool loop iterations
)

# CrewAI
agent = Agent(
    role="researcher",
    max_iter=15,  # Maximum iterations per task
    llm=llm
)

# Custom loop
MAX_STEPS = 30
for step in range(MAX_STEPS):
    result = agent_step(messages)
    if is_done(result):
        break
else:
    log.warning("Agent hit max steps without completing task")

A max iteration cap is your first line of defense. The budget cap is your second.

Subscriptions as a natural budget mechanism

Pay-per-token pricing gives agents an open-ended credit line. Subscriptions invert this — you decide upfront how much to spend, and the platform enforces it.

With a subscription plan on cheapestinference:

Each key gets a dollar budget that resets every 5 hours
When budget runs out → 429, never overage charges
You create unlimited keys — one per agent, each with its own budget
When your subscription expires, all keys are automatically revoked

This means your worst case is bounded. A runaway agent burns through one 5-hour budget window and stops. It doesn’t burn through your monthly allocation, because the next window starts fresh with a new budget.

For teams running multiple agents, the per-key isolation matters. Your research agent, your coding agent, and your monitoring agent each have independent budgets. If the research agent enters a loop, the others keep working.

The budget stack: defense in depth

No single mechanism catches every failure. Stack them:

Layer	What it catches	When it triggers
Max iterations (code)	Runaway tool loops	After N steps
Retry cap (code)	Repeated failed calls	After N consecutive errors
Budget cap (platform)	All spending, any cause	When dollar limit is reached
Subscription expiry (platform)	Abandoned agents	When subscription period ends

The first two are your responsibility as the developer. The last two are the platform’s. Together, they ensure that even if your code has a bug you haven’t found yet, the damage is capped.

What a budgeted agent looks like in practice

Here’s a complete pattern for a production agent:

from openai import OpenAI, RateLimitError
import requests
import time

client = OpenAI(
    base_url="https://api.cheapestinference.com/v1",
    api_key="sk_agent_research"
)

MAX_STEPS = 30
BUDGET_WARN_THRESHOLD = 0.02  # Switch models when < $0.02 left
RETRY_LIMIT = 3

def run_agent(task: str):
    messages = [
        {"role": "system", "content": "You are a research agent. ..."},
        {"role": "user", "content": task}
    ]
    model = "deepseek/deepseek-chat-v3-0324"
    consecutive_errors = 0

    for step in range(MAX_STEPS):
        # Check budget every 5 steps
        if step % 5 == 0 and step > 0:
            budget = check_budget("sk_agent_research")
            remaining = budget["limit"] - budget["spent"]
            if remaining < BUDGET_WARN_THRESHOLD:
                model = "meta-llama/llama-3.1-8b-instruct"

        try:
            response = client.chat.completions.create(
                model=model,
                messages=messages
            )
            consecutive_errors = 0
            content = response.choices[0].message.content
            messages.append({"role": "assistant", "content": content})

            if is_task_complete(content):
                return content

        except RateLimitError:
            save_agent_state(messages, step)
            return f"Budget limit reached at step {step}. State saved."

        except Exception as e:
            consecutive_errors += 1
            if consecutive_errors >= RETRY_LIMIT:
                return f"Aborting after {RETRY_LIMIT} consecutive errors: {e}"

    return "Max steps reached. Partial results saved."

Three layers of protection:

Max 30 steps — prevents infinite loops
3 consecutive error retry cap — prevents retry storms
Budget check every 5 steps — degrades to cheaper model before hitting the hard cap

If all three fail, the platform’s budget cap catches it anyway.

The bottom line

Running an AI agent without a budget is like running a process without memory limits — it works fine until it doesn’t, and then the damage is proportional to how long nobody noticed.

Budget caps don’t limit what your agent can do. They limit what it can do wrong. A properly budgeted agent completes the same tasks — it just can’t bankrupt you in the process.

Set a budget. Set a retry cap. Set a max iteration count. Then let your agent run.

We serve 70+ open-source models with per-key budget caps that reset every 5 hours. One subscription, unlimited keys, and the guarantee that a bad loop never turns into a bad bill. Get started or see how per-key plans work.

Building a multi-model architecture: route requests to the right LLM

Thu, 19 Mar 2026 00:00:00 GMT

Using one model for everything is the simplest architecture. It’s also the most wasteful. A 685B-parameter reasoning model answering “what’s the weather?” is like hiring a PhD to sort mail.

This guide covers how to use a small, fast model to classify incoming requests and route them to the right specialist. The result: lower latency, lower cost, and often better quality — because each model handles what it’s actually good at.

The problem with single-model architectures

Most applications start with one model:

User request --> Large Model --> Response

This works, but every request — simple or complex — pays the same latency and cost penalty. When 60% of your traffic is simple classification, FAQ, or extraction, you’re burning expensive compute on tasks a small model handles equally well.

Llama 3.1 8B

~200 t/s

DeepSeek V3.2

~60 t/s

DeepSeek R1

~30 t/s

The gap between Llama 8B and R1 is nearly 7x in throughput. Routing simple requests to the small model saves that difference on every request.

The multi-model architecture

User request --> Router (Llama 8B) --> classify intent
                                          |
                  +-----------+-----------+-----------+
                  |           |           |           |
               simple      general    reasoning     code
                  |           |           |           |
            Llama 3.1 8B  DeepSeek   DeepSeek R1   Qwen3
                            V3.2                    Coder
                  |           |           |           |
                  +-----+-----+-----+-----+
                        |
                     Response

Two stages:

Classify — The router model reads the user’s message and outputs a category. This takes ~0.2 seconds with Llama 8B.
Route — Based on the category, forward the request to the appropriate specialist model.

The router adds minimal overhead (~200ms) but saves significant compute by keeping simple requests away from expensive models.

Step 1: Classify with Llama 3.1 8B

Llama 3.1 8B is the router. At ~200 t/s output speed, ~0.2s TTFT, and $0.02/M input tokens, the classification step costs almost nothing and completes before the user notices.

The classification prompt is simple — you want a single-word category, not a conversation:

from openai import OpenAI

client = OpenAI(
    base_url="https://api.cheapestinference.com/v1",
    api_key="your-api-key"
)

def classify_request(user_message: str) -> str:
    """Classify a user message into a routing category."""
    response = client.chat.completions.create(
        model="meta-llama/llama-3.1-8b-instruct",
        messages=[
            {
                "role": "system",
                "content": (
                    "Classify the user's message into exactly one category. "
                    "Respond with only the category name, nothing else.\n\n"
                    "Categories:\n"
                    "- simple: greetings, FAQ, simple factual questions\n"
                    "- general: complex questions, analysis, writing, summarization\n"
                    "- reasoning: math, logic, multi-step problems, science\n"
                    "- code: code generation, debugging, refactoring, technical implementation\n"
                    "- agent: tasks requiring tool use, web search, or multi-step execution"
                )
            },
            {"role": "user", "content": user_message}
        ],
        max_tokens=10,
        temperature=0
    )
    category = response.choices[0].message.content.strip().lower()
    # Default to general if classification is unclear
    valid = {"simple", "general", "reasoning", "code", "agent"}
    return category if category in valid else "general"

The key details: max_tokens=10 because we only need one word. temperature=0 for deterministic routing. The system prompt is explicit about format — no preamble, just the category.

Step 2: Route to the specialist

Each category maps to a model optimized for that task:

# Model routing table
ROUTE_TABLE = {
    "simple":    "meta-llama/llama-3.1-8b-instruct",
    "general":   "deepseek/deepseek-chat-v3-0324",
    "reasoning": "deepseek/deepseek-reasoner",
    "code":      "qwen/qwen3-coder",
    "agent":     "moonshotai/kimi-k2-5",
}

def route_request(user_message: str, conversation_history: list) -> str:
    """Classify and route a request to the appropriate model."""
    category = classify_request(user_message)
    model = ROUTE_TABLE[category]

    response = client.chat.completions.create(
        model=model,
        messages=conversation_history + [
            {"role": "user", "content": user_message}
        ],
        stream=True
    )

    # Stream the response back
    full_response = ""
    for chunk in response:
        if chunk.choices[0].delta.content:
            content = chunk.choices[0].delta.content
            full_response += content
            print(content, end="", flush=True)

    return full_response

Notice that simple requests route back to Llama 8B — the same model that did the classification. For simple queries, the router overhead is effectively zero because the specialist is the same model and can reuse the warm connection.

Step 3: Handle edge cases

The basic router works for most traffic, but production systems need a few refinements:

def route_request_production(
    user_message: str,
    conversation_history: list,
    force_model: str = None
) -> tuple[str, str]:
    """Production router with overrides and fallback."""

    # Allow explicit model override (for power users or testing)
    if force_model:
        model = force_model
        category = "override"
    else:
        category = classify_request(user_message)
        model = ROUTE_TABLE[category]

    try:
        response = client.chat.completions.create(
            model=model,
            messages=conversation_history + [
                {"role": "user", "content": user_message}
            ]
        )
        return response.choices[0].message.content, category

    except Exception:
        # Fallback to V3.2 if the specialist is unavailable
        fallback = "deepseek/deepseek-chat-v3-0324"
        response = client.chat.completions.create(
            model=fallback,
            messages=conversation_history + [
                {"role": "user", "content": user_message}
            ]
        )
        return response.choices[0].message.content, f"{category}->fallback"

Three patterns worth noting:

Force model — Let callers bypass routing when they know what they need.
Fallback — If a specialist model is down, fall back to V3.2. It handles everything reasonably well.
Return the category — Log which route each request takes. You’ll need this data to tune the system.

Cost and latency comparison

Consider a workload of 1,000 requests with this distribution: 600 simple, 300 general, 70 reasoning, 30 code. Average 500 input tokens, 200 output tokens per request.

Single-model approach (everything on V3.2)

Avg latency

~4.5s

All 1000 reqs

V3.2 only

Every request waits for V3.2’s ~1.2s TTFT plus generation time at ~60 t/s. Simple questions get the same treatment as complex analysis.

Multi-model approach (routed)

Simple (600)

~1.2s (8B)

General (300)

~4.7s (V3.2)

Reasoning (70)

~9.0s (R1)

Code (30)

~3.5s (Coder)

The weighted average latency drops to approximately 2.7s — a 40% reduction. The 600 simple requests finish in ~1.2s instead of ~4.5s. That’s a 3.7x improvement for the majority of your traffic.

The 70 reasoning requests are slower individually (~9s vs ~4.5s) because R1 generates chain-of-thought tokens. But the quality on those specific requests is significantly better — R1 scores 50.2% on HLE versus V3.2’s 39.3%.

You get faster averages and better quality on the hard tail.

Real example: a support chatbot

A customer support chatbot receives three types of requests:

FAQ (60%) — “What are your business hours?” / “How do I reset my password?”
Complex support (30%) — “I was charged twice for order #12345, can you investigate?”
Technical issues (10%) — “Your API returns 500 when I send multipart form data with UTF-8 filenames”

Without routing

All requests go to DeepSeek V3.2. FAQs get correct answers but with unnecessary latency. Technical issues get decent answers but miss edge cases that a code-specialized model would catch.

With routing

SUPPORT_ROUTES = {
    "simple":    "meta-llama/llama-3.1-8b-instruct",  # FAQ, greetings
    "general":   "deepseek/deepseek-chat-v3-0324",     # Complex support
    "reasoning": "deepseek/deepseek-chat-v3-0324",     # Investigations
    "code":      "qwen/qwen3-coder",                   # Technical issues
    "agent":     "moonshotai/kimi-k2-5",               # Multi-step resolution
}

FAQs resolve in ~1 second via Llama 8B. Complex support issues get V3.2’s full analytical capability. Technical problems route to Qwen3 Coder, which understands the code context better. If a support issue requires looking up order data via API, it routes to Kimi K2.5 for tool-assisted resolution.

The classification step adds ~200ms. For the 60% of requests that drop from ~4.5s to ~1.2s, that’s an invisible cost.

When NOT to use multi-model routing

Routing adds complexity. Skip it when:

All your requests are the same type. If you’re building a code editor, just use Qwen3 Coder. No routing needed.
You have fewer than 100 requests/day. The cost savings don’t justify the engineering overhead at low volume.
Latency doesn’t matter. For batch processing or async workloads, a single capable model is simpler.
Your classification accuracy is low. If the router misclassifies frequently, you get worse results than a single good model. Test the classifier on real traffic before deploying.

The sweet spot is high-volume applications with diverse request types — chatbots, API gateways, developer tools, and customer-facing products where response time directly affects user experience.

Implementation checklist

Log your traffic. Before building a router, understand your request distribution. What percentage is simple? Complex? Code?
Start with two tiers. Llama 8B for simple, V3.2 for everything else. Add specialists only when you have data showing they help.
Measure classification accuracy. Sample 100 requests, manually label them, compare against the router’s output. Target >90% accuracy.
Add fallback. Every specialist route should fall back to V3.2 if the specialist is unavailable.
Monitor per-route metrics. Track latency, cost, and quality per category. This tells you where to optimize next.

All models in this guide are available through a single OpenAI-compatible API with no configuration changes between models. If you’re building a platform that needs LLM access for your users, see how per-key plans work.

Sources: Artificial Analysis Leaderboard · DeepSeek V3.2 · HLE Leaderboard · Kimi K2.5 Benchmarks

How to choose the right open-source model for your task

Thu, 19 Mar 2026 00:00:00 GMT

Most teams default to the biggest model available and call it a day. That works — until latency spikes, costs climb, and you realize a 8B-parameter model would have handled 60% of your requests just fine.

This guide maps common use cases to specific models, with real throughput numbers from our infrastructure. No theory — just which model to pick and why.

Quick decision table

Use case	Model	Why
General chat / assistants	DeepSeek V3.2	Best all-rounder. 85% MMLU-Pro, 73% SWE-bench, 60 t/s.
Complex reasoning	DeepSeek R1	50.2% on Humanity’s Last Exam. Chain-of-thought built in.
Code generation	Qwen3 Coder	Purpose-built for code. Strong on completions, refactoring, and debugging.
Agentic workflows	Kimi K2.5	334 t/s output, native tool use, 50.2% HLE with tools. Built for agents.
Vision / multimodal	Llama 4 Scout	17 active experts, 109B params, native image understanding.
Fast classification	Llama 3.1 8B	~200 t/s, 0.2s TTFT. Small enough for routing, tagging, extraction.
General (budget)	GLM 4.7 Flash	Fast inference, competitive quality. Good when V3.2 is overkill.
Long context chat	MiniMax M2.5	Native long-context support. Handles large documents well.
Large general + reasoning	Qwen3 235B	235B MoE. Strong across benchmarks when you need maximum capability.
Embeddings	BGE Large	MTEB-tested. Solid retrieval quality for RAG pipelines.

General chat and assistants

Pick: DeepSeek V3.2

DeepSeek V3.2 is the default choice for most workloads. It scores 85% on MMLU-Pro (beating Claude Opus 4.6’s 82%), 73% on SWE-bench Verified, and runs at ~60 tokens/second on our infrastructure.

Kimi K2.5

334 t/s

Llama 3.1 8B

~200 t/s

DeepSeek V3.2

~60 t/s

DeepSeek R1

~30 t/s

Good at: Broad knowledge, instruction following, multilingual, structured output. Not ideal for: Tasks that need step-by-step reasoning chains (use R1) or sub-100ms latency (use Llama 8B). Pick over alternatives when: You need a reliable general-purpose model that handles most tasks without specialization.

Complex reasoning

Pick: DeepSeek R1

R1 is a reasoning-first model. It produces explicit chain-of-thought tokens before its final answer. On Humanity’s Last Exam — a benchmark designed to be unsolvable by current models — R1 scores 50.2%, beating GPT-5.4 (41.6%) and Claude Opus 4.6 (40%).

The tradeoff is speed. At ~30 t/s, R1 is the slowest model in our lineup. That’s expected — it’s generating reasoning tokens that never appear in the final output.

Good at: Math, science, logic puzzles, multi-step problems, anything where “thinking” helps. Not ideal for: Simple Q&A, classification, or latency-sensitive applications. Pick over alternatives when: The task requires multi-step deduction. If a human would need to “think through it,” R1 will outperform faster models.

Code generation

Pick: Qwen3 Coder

Qwen3 Coder is purpose-built for software engineering tasks — code completion, refactoring, debugging, and generation across languages. It’s trained specifically on code-heavy data and optimized for developer workflows.

Good at: Code completion, bug fixing, refactoring, test generation, multi-file edits. Not ideal for: General conversation or non-code tasks (use V3.2). Pick over alternatives when: Code quality matters more than general knowledge. For mixed code-and-chat workflows, V3.2 or Kimi K2.5 may be more versatile.

Agentic workflows

Pick: Kimi K2.5

Kimi K2.5 was designed for agentic use. It has native tool-calling support, runs at 334 t/s (the fastest model we serve), and scores 50.2% on HLE when using tools — matching R1’s reasoning-only score.

The speed matters for agents. Each tool call is a round trip: the model generates a function call, the tool executes, the result goes back to the model. At 334 t/s and 0.31s TTFT, Kimi completes multi-step agent loops in seconds where slower models take minutes.

Good at: Tool use, function calling, multi-step task execution, fast iteration loops. Not ideal for: Pure reasoning without tools (R1 is better). Code-only tasks (Qwen3 Coder is more specialized). Pick over alternatives when: Your application involves tool calling, API interactions, or multi-step agent orchestration where speed compounds.

Vision and multimodal

Pick: Llama 4 Scout

Llama 4 Scout is Meta’s mixture-of-experts multimodal model — 109B total parameters with 17 active experts. It handles text and images natively, making it the pick for tasks that require visual understanding alongside language.

Good at: Image description, visual Q&A, document understanding, chart interpretation. Not ideal for: Text-only tasks where you’re paying for vision capability you don’t use (use V3.2). Pick over alternatives when: Your input includes images. For text-only workloads, other models are more efficient.

Fast classification and routing

Pick: Llama 3.1 8B

At 8 billion parameters, Llama 3.1 8B runs at ~200 t/s with approximately 0.2s time to first token. It’s the right choice for tasks where speed matters more than depth: intent classification, sentiment analysis, entity extraction, content filtering, and request routing.

Good at: Classification, tagging, extraction, routing decisions, simple Q&A, content moderation. Not ideal for: Complex reasoning, long-form generation, or tasks requiring deep world knowledge. Pick over alternatives when: You need results in under a second and the task is well-defined. Also ideal as the router model in a multi-model architecture.

Budget general use

Pick: GLM 4.7 Flash

GLM 4.7 Flash delivers competitive quality at fast inference speeds. When DeepSeek V3.2 is more capability than you need — simple conversations, basic summarization, FAQ bots — GLM 4.7 Flash gets the job done efficiently.

Good at: Simple chat, summarization, translation, basic Q&A. Not ideal for: Complex reasoning or tasks where benchmark-leading quality matters. Pick over alternatives when: You want good-enough quality with better speed and lower cost than the largest models.

Long context

Pick: MiniMax M2.5

MiniMax M2.5 handles long context windows natively. For workloads that involve ingesting large documents, long conversation histories, or extensive codebases, M2.5 maintains coherence across the full context.

Good at: Document analysis, long conversations, large-context summarization. Not ideal for: Short, simple tasks where context length is irrelevant (use Llama 8B or GLM Flash). Pick over alternatives when: Your input regularly exceeds what smaller-context models handle well.

Maximum capability

Pick: Qwen3 235B

Qwen3 235B is a large mixture-of-experts model that competes across the full benchmark spectrum. When you need the highest possible quality and latency is not the primary constraint, Qwen3 235B delivers.

Good at: Broad capability across reasoning, knowledge, and generation. Strong multilingual support. Not ideal for: Latency-sensitive applications (large model, slower inference). Pick over alternatives when: You need top-tier quality and can tolerate higher latency. Good for batch processing and offline tasks.

Embeddings

Pick: BGE Large

BGE Large (BAAI General Embedding) is a well-tested embedding model for retrieval-augmented generation. It performs well on MTEB benchmarks and produces dense vectors suitable for semantic search, document retrieval, and clustering.

Good at: Semantic search, RAG pipelines, document similarity, clustering. Not ideal for: Generative tasks (it’s an embedding model, not a chat model). Pick over alternatives when: You need vector embeddings for search or retrieval. Pair it with a generative model for the full RAG pipeline.

The decision tree

What's your task?
|
+-- Need to understand images?
|   YES --> Llama 4 Scout
|
+-- Need step-by-step reasoning? (math, logic, science)
|   YES --> DeepSeek R1 (~30 t/s, but highest reasoning quality)
|
+-- Need tool calling / agent loops?
|   YES --> Kimi K2.5 (334 t/s, native tool use)
|
+-- Need code generation / editing?
|   YES --> Qwen3 Coder (purpose-built for code)
|
+-- Need embeddings for search/RAG?
|   YES --> BGE Large
|
+-- Need sub-200ms response?
|   YES --> Llama 3.1 8B (~200 t/s, 0.2s TTFT)
|
+-- Need long context (large documents)?
|   YES --> MiniMax M2.5
|
+-- Need maximum quality, latency flexible?
|   YES --> Qwen3 235B
|
+-- General purpose, good balance?
    YES --> DeepSeek V3.2 (default choice)

The 80/20 rule

You don’t need ten models to cover most workloads.

Llama 3.1 8B handles 60% of requests. Classification, routing, simple Q&A, extraction, content filtering. Fast and cheap.

DeepSeek V3.2 handles 30%. General chat, complex instructions, knowledge-intensive tasks. The reliable all-rounder.

Specialized models handle the last 10%. R1 for hard reasoning. Kimi K2.5 for agent loops. Qwen3 Coder for code. BGE Large for embeddings.

Start with Llama 8B + V3.2. Add specialists only when you have evidence that general models aren’t performing on specific task categories. Measure first, specialize second.

All models are available through a single OpenAI-compatible API. If you’re building a platform that needs LLM access for your users, see how per-key plans work.

Sources: Artificial Analysis Leaderboard · SWE-bench Leaderboard · Kimi K2.5 Benchmarks · DeepSeek V3.2 · HLE Leaderboard · MMLU-Pro Leaderboard · MTEB Leaderboard

Open-source models are production-ready. Here's the proof.

Thu, 19 Mar 2026 00:00:00 GMT

There’s a persistent assumption in the industry: open-source models are fine for experimentation, but production workloads need GPT-5 or Claude Opus. We run open-source models in production every day. Here’s what the benchmarks actually say.

We’re comparing 5 models across 5 metrics — the same models in every chart, no cherry-picking:

Open-source (available via our API): DeepSeek V3.2, DeepSeek R1, Kimi K2.5 Proprietary (reference): Claude Opus 4.6, GPT-5.4

Code quality: SWE-bench Verified (% resolved)

Claude Opus 4.6

80.8%

GPT-5.4

~80.0%

Kimi K2.5

76.8%

DeepSeek V3.2

73.0%

DeepSeek R1

57.6%

Proprietary models lead here. Opus 4.6 and GPT-5.4 are within a point of each other at ~80%. Kimi K2.5 is 4 points behind at 76.8% — competitive but not leading. R1 is a reasoning model, not optimized for code.

Reasoning: Humanity’s Last Exam (%)

Kimi K2.5 *

50.2%

DeepSeek R1

50.2%

GPT-5.4

41.6%

Claude Opus 4.6

40.0%

DeepSeek V3.2

39.3%

Open-source wins decisively. R1 hits 50.2% and Kimi K2.5 matches it with tool-use enabled (*without tools: 31.5%). Both beat Opus 4.6 (40%) and GPT-5.4 (41.6%). V3.2 is roughly at Opus level — it’s a general model, not a reasoning specialist.

*Kimi K2.5’s HLE score uses its agentic mode with tool access. This is how the model is designed to be used in production.

Knowledge: MMLU-Pro (%)

GPT-5.4

88.5%

Kimi K2.5

87.1%

DeepSeek V3.2

85.0%

DeepSeek R1

84.0%

Claude Opus 4.6

82.0%

GPT-5.4 leads narrowly at 88.5%, but Kimi K2.5 is 1.4 points behind and all three open-source models beat Opus 4.6. The gap across all 5 models is only 6.5 points — this benchmark is nearly saturated.

Speed: output tokens per second

Kimi K2.5

334 t/s

GPT-5.4

~78 t/s

DeepSeek V3.2

~60 t/s

Claude Opus 4.6

46 t/s

DeepSeek R1

~30 t/s

Kimi K2.5 at 334 tok/s is in a different league — 4x faster than GPT-5.4, 7x faster than Opus 4.6. R1 is the slowest (expected — it’s a reasoning model producing chain-of-thought tokens).

Latency: time to first token (seconds)

Kimi K2.5

0.31s

GPT-5.4

~0.95s

DeepSeek V3.2

1.18s

DeepSeek R1

~2.0s

Claude Opus 4.6

2.48s

Lower is better. Kimi K2.5 responds 8x faster than Opus 4.6 and 3x faster than GPT-5.4. Even V3.2 beats both proprietary models. Opus 4.6 is the slowest model in this comparison.

Speed and TTFT measured on our production infrastructure. Claude and GPT-5.4 data from Artificial Analysis.

The full picture

The scorecard

Metric	Winner	Open-source	Proprietary	Gap
Code (SWE-bench)	Opus 4.6	Kimi 76.8%	Opus 80.8%	-4 pts
Reasoning (HLE)	R1	R1 50.2%	GPT-5.4 41.6%	+8.6 pts
Knowledge (MMLU-Pro)	GPT-5.4	Kimi 87.1%	GPT-5.4 88.5%	-1.4 pts
Speed (tok/s)	Kimi K2.5	334 t/s	GPT-5.4 78 t/s	4.3x faster
Latency (TTFT)	Kimi K2.5	0.31s	GPT-5.4 0.95s	3x faster

Open-source wins 3 out of 5. Proprietary models lead on Code (by 4 points) and Knowledge (by 1.4 points). Open-source leads on Reasoning (by 8.6 points), Speed (by 4.3x), and Latency (by 3x).

Note: Kimi K2.5’s HLE score (50.2%) uses tool-augmented mode. Without tools it scores 31.5%. DeepSeek R1’s 50.2% is pure chain-of-thought reasoning without tools.

What “production-ready” actually means

Reliable enough. Consistent quality across thousands of requests.
Fast enough. Kimi K2.5 at 334 tok/s and 0.31s TTFT. That’s real-time.
Capable enough. Within 4 points of the best proprietary model on code, ahead on reasoning.
Predictable. Versioned models that don’t change without warning.

The real advantage: control

Proprietary models change under you. Fine one day, different behavior the next. No changelog, no warning. Open-source models are versioned — DeepSeek V3.2 behaves the same tomorrow as today. You choose when to upgrade.

For production workloads, that predictability is worth more than a marginal quality edge on any single benchmark.

We serve 70+ open-source models through a single API. If you’re building a platform that needs LLM access for your users, see how per-key plans work.

Sources: Artificial Analysis Leaderboard · SWE-bench Leaderboard · Kimi K2.5 Benchmarks · DeepSeek V3.2 · OpenAI Pricing · Anthropic Pricing · HLE Leaderboard · MMLU-Pro Leaderboard

What it takes to build your own LLM inference platform

Thu, 19 Mar 2026 00:00:00 GMT

If you’re building a SaaS that needs to give users access to LLMs, you have two options: build the infrastructure yourself, or use a platform that does it for you. Here’s what “build it yourself” actually looks like.

This isn’t theoretical. We built this. Here’s every component, what it does, and what alternatives exist.

0. Model access — the first problem

Before you write a single line of code, you need access to models.

Self-host on your own hardware: Buy GPUs, rent datacenter space, run the models yourself. Full control, best unit economics at scale — but massive upfront cost and you’re limited to the models you can afford to deploy. Running DeepSeek V3.2 requires multiple high-end GPUs. Running 70+ models? You’d need a data center.

Rent infrastructure: Use GPU clouds like Vast.ai, AWS, Hetzner, CoreWeave, or Lambda. No hardware to buy, but you still manage deployments, scaling, and failover. Costs add up fast — a single H100 runs $2-4/hr.

Use an inference provider: Sign agreements with DeepInfra, Together.ai, Fireworks, etc. who already have the models deployed. Pay per token, no GPU management. But you depend on their availability, pricing, and terms. If they change prices or drop a model, you need a plan B.

Mix: Most serious platforms end up here. Own hardware for high-volume models where the unit economics justify it, rented GPUs for burst capacity, and provider agreements for the long tail of models nobody runs enough to self-host.

Self-hosting 70+ models on your own is economically unrealistic. The real question is where to draw the line between own infra, rented compute, and providers.

1. Serving engine

If you self-host or rent GPUs, you need software to serve the models:

vLLM — most popular, good throughput, active community
TGI (Text Generation Inference) — Hugging Face’s solution, solid for single-model deployments
TensorRT-LLM — NVIDIA’s optimized engine, best raw performance but harder to set up
SGLang — newer, fast, good for structured generation

You’ll also need to handle model weights, quantization, scaling across GPUs, and failover when a node goes down. This is a full-time ops job.

2. API proxy layer

Your users shouldn’t hit the inference backend directly. You need a proxy that:

Translates between API formats (OpenAI, Anthropic)
Routes requests to the right model/provider
Injects authentication
Handles retries and failover
Strips provider headers so users don’t know your backend

Options:

Build from scratch with Express/Fastify + http-proxy-middleware
Use an open-source gateway: LiteLLM, Portkey, Kong AI Gateway, MLflow Gateway
Use a managed gateway: Helicone, Braintrust, Promptlayer

Each has trade-offs. Open-source gateways give you control but you manage the deployment. Managed gateways are easier but add latency and cost.

3. Authentication

Two layers:

User auth (dashboard login)

Firebase Auth, Auth0, Clerk, Supabase Auth, or roll your own
Supports email, Google, GitHub, wallet signatures

API key auth (inference requests)

Generate API keys per user
Validate on every request before proxying
Store key metadata (plan, rate limits, owner)

This is where it gets interesting for platforms. You need per-key plans — each key with its own rate limits and usage tracking. Most auth solutions don’t do this out of the box. You’ll need a custom key management layer.

4. Rate limiting

Per-key rate limiting with at least:

RPM (requests per minute)
TPM (tokens per minute)
Budget caps (dollar amount per time window)

This needs to be enforced at the proxy layer, before the request hits the inference backend. Otherwise a single user can exhaust your GPU allocation.

Options:

Redis-based counters (most common)
Token bucket algorithms
Proxy-level enforcement (some gateways include this)

If you’re using per-key plans, each key needs its own set of limits. Not one global limit — individual limits per key.

5. Usage tracking and billing

You need to know:

How many tokens each key consumed (input + output)
What model was used
Cost per request
Aggregate usage per user, per day, per billing period

For subscription billing:

Stripe for card payments
Budget windows (e.g., $X per 5-hour period)
Automatic key revocation when subscription expires

For pay-as-you-go:

Credit balance per user
Deduct per request based on token count × model price
Top-up flow (Stripe, crypto, etc.)

For crypto payments:

USDC on a supported chain
On-chain transaction verification
Wallet connector in the dashboard (wagmi, viem, etc.)

This is a significant amount of code. Usage tracking alone requires intercepting every response to count tokens, calculating cost based on the model’s pricing, and storing it per key.

6. Dashboard

Your users need a web UI to:

Create and manage API keys
View usage per key (tokens, requests, cost)
Subscribe to plans or top up credits
See available models and pricing

Tech stack typically:

React/Next.js/Vue frontend
REST API backend
Real-time usage updates

For platforms (your users creating keys for their users), you also need a management API — programmatic key creation, plan assignment, usage queries.

7. Model catalog management

Models change. New ones come out weekly. You need:

A catalog of which models you serve
Pricing per model (input/output cost per token)
Sync mechanism to update prices when providers change them
Display names, categories, tags for the dashboard
Cache pricing metadata (some models support prompt caching discounts)

This is an ongoing operational burden, not a one-time setup.

8. Documentation

Your users need:

API reference (endpoints, request/response formats)
SDK examples (Python, Node.js, at minimum)
Authentication guide
Billing/usage documentation
Quick start guide

This is easily 20-30 pages of documentation that needs to stay current.

9. Monitoring and reliability

Health checks on the inference backend
Status page for users
Alerting when latency spikes or errors increase
Logging (but not logging prompt content — privacy)
Graceful degradation when a model or provider is down

10. Compliance and privacy

Privacy policy
Data handling documentation
GDPR compliance if you serve EU users
Decision: do you store prompts? (You shouldn’t)
SOC 2 / ISO 27001 if targeting enterprise

The full stack

Component	Ongoing maintenance
Inference backend	High — scaling, failover, model updates
API proxy	Medium — format changes, new providers
Auth + key management	Low
Per-key rate limiting	Low
Usage tracking + billing	Medium — edge cases, reconciliation
Dashboard	Medium — new features, UX
Model catalog	High — weekly model updates
Documentation	Medium — keep current
Monitoring	Low
Privacy/compliance	Low

What breaks in production

Building is the easy part. The hard part is what breaks with real users:

A provider changes their API format without warning. Your proxy returns 500s for 2 hours until you notice.
A model gets deprecated. Your users’ hardcoded model IDs stop working overnight.
Token counting has an off-by-one bug. You’ve been undercharging for 3 weeks. Your margin is gone.
A user finds a way to exceed rate limits through concurrent requests. Your inference bill spikes 10x in one afternoon.
Stripe webhook fails silently. A user’s subscription expired but their API key still works. Free inference for a month.
You push a billing update and break the usage tracking. Three days of missing data. Users open tickets.

Each of these has happened to us. We fixed them. The question is whether you want to fix them yourself, with your users waiting, or use a platform that already has.

Or

You use an inference platform that already has all of this, create API keys for your users, and ship your product this week.

We built all of the above so you don’t have to. See how per-key plans work.

CheapestInference | Blog

Qwen 3.5 vs GPT-5.4 vs Claude Opus 4.6 — same quality, fraction of the price

Knowledge: MMLU-Pro (%)

Reasoning: GPQA Diamond (%)

Code: LiveCodeBench v6 (%)

Speed: output tokens per second

Price: input cost per million tokens

The full picture

The scorecard

The real question: what are you paying for?

Also available: specialized Qwen models

OpenClaw is free. Running it is not.

Where the tokens go

The three cost traps

1. Context grows quadratically

2. System prompt is a fixed tax

3. Wrong model for the job

The math on flat-rate vs. pay-per-token

What we’d actually recommend

The irony

Why your AI agent needs a budget

Agents consume 10–50x more tokens than humans

The three failure modes that drain budgets

1. Infinite tool loops

2. Context accumulation

3. Wrong model for the job

What happens without a budget

How budget caps work

Designing agents that handle budget limits gracefully

Catch 429s and degrade

Monitor spend proactively

Set retry caps in your agent framework

Subscriptions as a natural budget mechanism

The budget stack: defense in depth

What a budgeted agent looks like in practice

The bottom line

Building a multi-model architecture: route requests to the right LLM

The problem with single-model architectures

The multi-model architecture

Step 1: Classify with Llama 3.1 8B

Step 2: Route to the specialist

Step 3: Handle edge cases

Cost and latency comparison

Single-model approach (everything on V3.2)

Multi-model approach (routed)

Real example: a support chatbot

Without routing

With routing

When NOT to use multi-model routing

Implementation checklist

How to choose the right open-source model for your task

Quick decision table

General chat and assistants

Complex reasoning

Code generation

Agentic workflows

Vision and multimodal

Fast classification and routing

Budget general use

Long context

Maximum capability

Embeddings

The decision tree

The 80/20 rule

Open-source models are production-ready. Here's the proof.

Code quality: SWE-bench Verified (% resolved)

Reasoning: Humanity’s Last Exam (%)

Knowledge: MMLU-Pro (%)

Speed: output tokens per second

Latency: time to first token (seconds)

The full picture

The scorecard

What “production-ready” actually means

The real advantage: control

What it takes to build your own LLM inference platform

0. Model access — the first problem

1. Serving engine

2. API proxy layer

3. Authentication

4. Rate limiting