Skip to content

Blog

The real cost of running AI agents in production

Chatbots are cheap. Agents are not.

A chatbot sends a user message, gets a response, displays it. Maybe 2,000 tokens per exchange. An agent reads files, calls tools, retries on errors, re-sends the entire conversation every step, and does this 20–60 times per task. Same API, completely different economics.

If you’re budgeting for AI agents the same way you budget for a chatbot, you’re underestimating by 10–50x.


We measured token consumption across three workload types, each running for one hour:

Coding agent (OpenClaw)
~2.1M tokens
Research agent (CrewAI)
~1.2M tokens
RAG chatbot
~200K tokens
Simple chatbot
~40K tokens

The coding agent consumed 52x more tokens than a simple chatbot in the same time period. And this is normal — the agent was doing useful work the entire time.


Three architectural properties of agents make them expensive:

Every agent step appends tool outputs to the conversation. The LLM re-processes the entire conversation on each step. If the agent reads a 3,000-token file at step 5, that file gets re-sent at steps 6, 7, 8… all the way to the end.

For a 40-step task, one file read costs: 3,000 tokens × 35 remaining steps = 105,000 tokens in re-transmission.

This is why agent token consumption grows quadratically, not linearly.

Agent frameworks use large system prompts — OpenClaw’s is ~9,600 tokens, CrewAI’s varies by agent configuration. This prompt is sent with every request. Over 40 steps, the system prompt alone costs 384,000 tokens.

When a tool call fails, the agent retries. Each retry sends the full context plus the error message. Three retries on a 30K-token context wastes 90K tokens with no productive output.

Without a retry cap, this can run indefinitely — always bound agents with a retry cap and a maximum iteration count.


Assuming one developer running 15 agent tasks per day, 22 working days per month, ~500K tokens per task:

Model Cost/task Daily (×15) Monthly
Claude Opus 4.6 $9.18 $137.70 $3,029
Claude Sonnet 4.6 $2.25 $33.75 $743
GPT-5.4 $4.73 $70.95 $1,561
DeepSeek V3.2 $0.16 $2.40 $53
Qwen 3.5 35B $0.04 $0.60 $13
CheapestInference (full day) from $39 flat

A team of 5 developers each running 15 tasks/day on Claude Opus spends $15,145/month. The same team on flat-rate via CheapestInference pays a fixed monthly subscription per seat (from $39 for a reserved daily time block) — no matter how many tokens those agents burn. That’s an order-of-magnitude reduction.


Four strategies to cut agent inference costs

Section titled “Four strategies to cut agent inference costs”

DeepSeek V3.2 and Qwen 3.5 score within 4 points of GPT-5.4 and Opus on most benchmarks. For coding tasks specifically, DeepSeek V3.2 matches Opus on HumanEval and SWE-bench. Full data: Open-source models are production-ready.

Not every agent step needs a frontier model. File reads, simple classifications, and formatting don’t need 685B parameters. Use a small model for easy steps and a large model for hard ones. Full guide: Building a multi-model architecture.

Give each agent its own API key so one runaway agent can’t starve the others. On a time-block subscription each key gets unlimited usage during its reserved hours, so you isolate workloads without juggling per-token allocations.

Per-token pricing penalizes the exact patterns agents use: large contexts, many steps, retries. Flat-rate pricing makes all of that free. During your reserved time blocks your agent can use the full context window and retry freely without increasing the bill — reserve all three blocks for 24/7 coverage.


Here’s the equation most teams miss:

Agent cost = tokens_per_step × steps × cost_per_token

Most optimization focuses on cost_per_token — switching to a cheaper model. But tokens_per_step grows with context (quadratic), and steps is unpredictable. Optimizing only one variable leaves the other two working against you.

Flat-rate pricing eliminates all three variables from your bill. The cost is the subscription. Period.


We serve Kimi K2.6, GLM 4.7, and MiniMax M2.5 with flat-rate, unlimited time-block subscriptions — no token counting, no budget caps during your reserved hours. Reserve 1–3 daily 8-hour blocks from $39/month and your agent’s token consumption never becomes your problem. Get started or see plans.

Qwen 3.5 vs GPT-5.4 vs Claude Opus 4.6 — same quality, fraction of the price

You asked for this. After our first benchmark post, the most requested model was Qwen 3.5. Here it is — 4 models across 5 metrics, same models in every chart:

Open-source: Qwen3.5-397B-A17B (flagship), Qwen3.5-35B-A3B (efficient) Proprietary: GPT-5.4, Claude Opus 4.6


GPT-5.4
88.5%
Qwen3.5 397B
87.8%
Qwen3.5 35B
85.3%
Claude Opus 4.6
82.0%

GPT-5.4 leads at 88.5%, but Qwen3.5-397B is 0.7 points behind — statistically noise. The 35B with only 3B active parameters scores 85.3%, beating Opus by 3 points. The total spread across all four models is just 6.5 points.

Qwen3.5-397B matches GPT-5.4 at 5x less cost. The 35B beats Opus at 23x less.


GPT-5.4
92.0%
Claude Opus 4.6
91.3%
Qwen3.5 397B
88.4%
Qwen3.5 35B
84.2%

Proprietary models lead on graduate-level reasoning. GPT-5.4 at 92% and Opus at 91.3% are strong. But Qwen3.5-397B at 88.4% is within 4 points — and costs $0.54/M vs $2.50 and $5.00. The 35B at 84.2% is still PhD-level performance for $0.22/M input.


GPT-5.4
84.0%
Qwen3.5 397B
83.6%
Claude Opus 4.6
76.0%
Qwen3.5 35B
74.6%

The 397B essentially ties GPT-5.4 on competitive coding — 0.4 points apart. Both beat Opus by 8+ points. The 35B at 74.6% is within 2 points of Opus, at 1/23rd the price.

For dedicated coding workloads, the open ecosystem also offers Qwen3-Coder-480B (SWE-bench Verified: 69.6%, comparable to Claude Sonnet 4).


Qwen3.5 35B
178 t/s
Qwen3.5 397B
84 t/s
GPT-5.4
~78 t/s
Claude Opus 4.6
46 t/s

The 35B’s MoE architecture pays off — 178 tok/s is 2.3x faster than GPT-5.4 and 3.9x faster than Opus. Even the 397B flagship at 84 tok/s outpaces both proprietary models. This is what happens when only 3-17B parameters activate per token instead of the full model.

Speed data from Artificial Analysis. Actual speeds on our infrastructure may differ.


Qwen3.5 35B
$0.22
Qwen3.5 397B
$0.54
GPT-5.4
$2.50
Claude Opus 4.6
$5.00

This is the chart that matters. Opus costs 23x more than the 35B and 9x more than the 397B. GPT-5.4 costs 5x more than the 397B. The quality difference? Single-digit percentage points on every benchmark.


Code Reasoning Knowledge Speed Qwen3.5 397B Qwen3.5 35B GPT-5.4 Opus 4.6

Quality only — no price axis. GPT-5.4 (gray) has the largest shape. Opus (dashed) is strong on reasoning and code. The 397B (indigo) nearly overlaps GPT-5.4 on code and knowledge. The 35B (teal) pulls hard left on speed — 178 tok/s is 2.3x faster than anything else here. Price tells its own story in the chart above.

MetricWinnerQwen3.5 397BGPT-5.4Claude Opus 4.6Gap (397B vs best)
Knowledge (MMLU-Pro)GPT-5.487.8%88.5%82.0%-0.7 pts
Reasoning (GPQA)GPT-5.488.4%92.0%91.3%-3.6 pts
Code (LiveCodeBench)GPT-5.483.6%84.0%76.0%-0.4 pts
Speed (tok/s)Qwen3.5 397B84 t/s~78 t/s46 t/s1.1x faster
Price ($/M input)Qwen3.5 397B$0.54$2.50$5.004.6x cheaper

Same weight class, different price tag. The 397B trades 0.4–3.6 points on quality for 4.6x lower price and faster speed. It beats Opus on 4 out of 5 metrics outright.

Note: The Qwen3.5-35B-A3B ($0.22/M) scores 85.3% MMLU-Pro, 84.2% GPQA, 74.6% LiveCodeBench at 178 tok/s — beating Opus on knowledge and speed at 23x less cost. A different weight class, but worth considering if speed and price matter more than the last few quality points.


The real question: what are you paying for?

Section titled “The real question: what are you paying for?”

The quality gap between Qwen3.5-397B and GPT-5.4 is 0.7 points on knowledge, 0.4 points on code. The price gap is 4.6x.

Put it differently:

ModelMMLU-ProCost per quality point
Qwen3.5 35B85.3%$0.003 per point per M tokens
Qwen3.5 397B87.8%$0.006 per point per M tokens
GPT-5.488.5%$0.028 per point per M tokens
Claude Opus 4.682.0%$0.061 per point per M tokens

Opus costs 20x more per quality point than the 35B — and scores lower. GPT-5.4 leads on quality but costs 5-10x more for single-digit advantages.

For most workloads, the last 3% of benchmark performance isn’t worth a 5x price increase. And for workloads where it is — the 397B gets you within 1 point of GPT-5.4 at a fraction of the cost.


Also worth knowing: specialized Qwen models

Section titled “Also worth knowing: specialized Qwen models”

Beyond the general-purpose models, the Qwen family includes two notable specialists:

  • Qwen3-Coder-480B — SWE-bench Verified 69.6%, comparable to Claude Sonnet 4. Built for agentic coding.
  • Qwen3-235B-Thinking — Chain-of-thought reasoning specialist. When you need the model to show its work.

CheapestInference serves frontier open-weight models — Kimi K2.6, GLM 4.7, and MiniMax M2.5 — on unlimited time-block subscriptions from $39/mo, or pay-as-you-go credits from $10. See plans and try it →

Sources: Qwen3.5-397B Model Card · Qwen3.5-35B Model Card · Artificial Analysis Leaderboard · GPQA Diamond Leaderboard · OpenAI Pricing · Anthropic Pricing · LiveCodeBench Leaderboard

OpenClaw is free. Running it is not.

OpenClaw has 247,000 GitHub stars. It’s free, open-source, and runs locally. You install it, point it at an LLM, and it writes code, browses the web, queries databases, and executes files on your behalf.

The agent is free. The inference is not.

Every time OpenClaw calls a model, it re-sends the entire conversation history — every tool output, every file it read, every intermediate result. By iteration 20 of a typical task, the input context is 30,000+ tokens. By iteration 40, it’s past 100,000. And it sends this every single request.

This is not a bug. It’s how agents work. And it’s why running OpenClaw on pay-per-token APIs costs $300–600/month for active users — sometimes more.


We broke down token consumption for a typical OpenClaw coding task: “add authentication to an Express API.” The agent completed it in 38 tool calls.

Context accumulation
~280K tokens
System prompt (×38)
~156K tokens
Tool outputs (files, etc.)
~70K tokens
Agent output
~19K tokens

Total: ~525,000 tokens for a single task. The agent’s actual output — the code it wrote — was 19K tokens. The other 96% is overhead.

On Claude Opus at $15/M input + $75/M output, that single task costs $9.18. Run five tasks a day and you’re at $1,377/month.

On DeepSeek V3.2 via a pay-per-token provider at $0.27/M input + $1.10/M output, the same task costs $0.16. Better — but 20 tasks a day is still $96/month, and that’s one agent.


Here’s the OpenClaw-specific version:

OpenClaw reads files into context. If it reads a 2,000-token file at step 5, that file gets re-sent at steps 6, 7, 8… all the way to 38. That single file read costs 2,000 × 33 remaining steps = 66,000 tokens in re-transmission alone.

Users report session contexts at 56–58% of the 400K context window during normal use. This isn’t a failure mode — it’s the architecture working as designed.

OpenClaw’s system prompt is ~9,600 tokens. It gets sent with every request. Over 38 tool calls, that’s 365K tokens just in system prompts. You pay this whether the agent does useful work or not.

OpenClaw defaults to a single model for everything. But not every tool call needs the same intelligence:

  • Reading a file and deciding what to edit? Llama 3.1 8B handles this at 200 tokens/sec.
  • Writing complex authentication logic? A frontier open-weight model like Kimi K2.6 is the right call.
  • Formatting a config file? Any 8B model is overkill but still cheaper than Opus.

We wrote a full guide on this pattern: Building a multi-model architecture. Routing agent requests to the right model can cut costs by 60–80% without reducing output quality.


Here’s the comparison for an OpenClaw user running ~20 tasks/day:

Provider Cost/task 20 tasks/day Monthly
Claude Opus (direct) $9.18 $183.60 $5,508
GPT-5.4 (direct) $4.73 $94.60 $2,838
DeepSeek V3.2 (per-token) $0.16 $3.20 $96
CheapestInference from $39/mo

Flat-rate means you don’t care about context accumulation. The 280K tokens of context overhead that makes pay-per-token expensive? Irrelevant. The system prompt tax? Doesn’t matter. Your agent can call models 24/7 and the bill is the same.


If you’re running OpenClaw, here’s the setup we see working best:

1. Use open-weight models. Frontier open-weight models like Kimi K2.6 and GLM 4.7 score within a few points of proprietary models on coding benchmarks (the data). The gap doesn’t justify a 50x cost difference.

2. Route by complexity. Don’t send file reads and simple decisions to the same model as complex code generation. A router model costs fractions of a cent per classification. Full guide: Multi-model architecture.

3. Reserve the hours you work. On CheapestInference you reserve one or more daily 8-hour time blocks (Asia-Pacific, Europe, Americas — pick 1–3, all three is full 24/7). During your reserved hours inference is unlimited with no budget cap. One API key per agent, one concurrent request per key. Outside your window, requests return 429 until your block opens again.

4. Handle rate limits automatically. Time blocks mean your agent will hit 429s outside your reserved window — that’s expected. But OpenClaw kills the conversation when it gets a 429. The agent stops, and if you close the dashboard, that conversation is gone.

We built an OpenClaw plugin that fixes this: openclaw-ratelimit-retry. It hooks into agent_end, detects retriable 429s, parks the session on disk, and waits for the budget window to reset. Then it sends chat.send to the original session — resuming the conversation with its full transcript, as if you had typed a message.

Terminal window
openclaw plugins install @cheapestinference/openclaw-ratelimit-retry
~/.openclaw/config.yaml
plugins:
ratelimit-retry:
budgetWindowHours: 8 # matches your CheapestInference 8-hour time block
maxRetryAttempts: 3 # give up after 3 consecutive 429s
checkIntervalMinutes: 5 # check every 5 min for ready retries

The plugin is zero-dependency, persists across server restarts, deduplicates by session, and handles edge cases like sub-agents, queue overflow, and corrupted state files. If the retry itself hits a 429, it re-queues automatically. No tokens wasted on re-sending from scratch — the agent picks up exactly where it left off.

This turns budget caps from “your agent crashes” into “your agent naps and wakes up.” Set it up once and forget about it.

5. Consider unlimited time blocks. If your agent runs more than a few tasks per day, per-token pricing works against you. Every token of context overhead is money. With an unlimited time-block subscription, context overhead is free during your reserved hours — re-send the full window, let the agent work without a budget cap.


OpenClaw is free because the code runs on your machine. But the valuable part — the intelligence — runs on someone else’s GPUs. The agent framework is the cheap part. Inference is the expensive part.

Open-source models on flat-rate infrastructure flip this equation. The models are free. The inference is flat. The only variable cost left is your time.

Point your OpenClaw base_url at https://api.cheapestinference.com/v1 and find out what unconstrained agents actually cost: nothing more than you already budgeted.

Building a multi-model architecture: route requests to the right LLM

Using one model for everything is the simplest architecture. It’s also the most wasteful. A 685B-parameter reasoning model answering “what’s the weather?” is like hiring a PhD to sort mail.

This guide covers how to use a small, fast model to classify incoming requests and route them to the right specialist. The result: lower latency, lower cost, and often better quality — because each model handles what it’s actually good at.


The problem with single-model architectures

Section titled “The problem with single-model architectures”

Most applications start with one model:

User request --> Large Model --> Response

This works, but every request — simple or complex — pays the same latency and cost penalty. When 60% of your traffic is simple classification, FAQ, or extraction, you’re burning expensive compute on tasks a small model handles equally well.

Llama 3.1 8B
~200 t/s
DeepSeek V3.2
~60 t/s
DeepSeek R1
~30 t/s

The gap between Llama 8B and R1 is nearly 7x in throughput. Routing simple requests to the small model saves that difference on every request.


User request --> Router (Llama 8B) --> classify intent
|
+-----------+-----------+-----------+
| | | |
simple general reasoning code
| | | |
Llama 3.1 8B DeepSeek DeepSeek R1 Qwen3
V3.2 Coder
| | | |
+-----+-----+-----+-----+
|
Response

Two stages:

  1. Classify — The router model reads the user’s message and outputs a category. A fast model returns this in a fraction of a second.
  2. Route — Based on the category, forward the request to the appropriate specialist model.

The router adds minimal overhead (~200ms) but saves significant compute by keeping simple requests away from expensive models.


A fast model like MiniMax M2.5 makes a good router. With low TTFT and a short, single-word output, the classification step costs almost nothing and completes before the user notices.

The classification prompt is simple — you want a single-word category, not a conversation:

from openai import OpenAI
client = OpenAI(
base_url="https://api.cheapestinference.com/v1",
api_key="your-api-key"
)
def classify_request(user_message: str) -> str:
"""Classify a user message into a routing category."""
response = client.chat.completions.create(
model="MiniMax-M2.5",
messages=[
{
"role": "system",
"content": (
"Classify the user's message into exactly one category. "
"Respond with only the category name, nothing else.\n\n"
"Categories:\n"
"- simple: greetings, FAQ, simple factual questions\n"
"- general: complex questions, analysis, writing, summarization\n"
"- reasoning: math, logic, multi-step problems, science\n"
"- code: code generation, debugging, refactoring, technical implementation\n"
"- agent: tasks requiring tool use, web search, or multi-step execution"
)
},
{"role": "user", "content": user_message}
],
max_tokens=10,
temperature=0
)
category = response.choices[0].message.content.strip().lower()
# Default to general if classification is unclear
valid = {"simple", "general", "reasoning", "code", "agent"}
return category if category in valid else "general"

The key details: max_tokens=10 because we only need one word. temperature=0 for deterministic routing. The system prompt is explicit about format — no preamble, just the category.


Each category maps to a model optimized for that task:

# Model routing table
ROUTE_TABLE = {
"simple": "MiniMax-M2.5",
"general": "glm-4.7",
"reasoning": "glm-4.7",
"code": "glm-4.7",
"agent": "kimi-k2.6",
}
def route_request(user_message: str, conversation_history: list) -> str:
"""Classify and route a request to the appropriate model."""
category = classify_request(user_message)
model = ROUTE_TABLE[category]
response = client.chat.completions.create(
model=model,
messages=conversation_history + [
{"role": "user", "content": user_message}
],
stream=True
)
# Stream the response back
full_response = ""
for chunk in response:
if chunk.choices[0].delta.content:
content = chunk.choices[0].delta.content
full_response += content
print(content, end="", flush=True)
return full_response

Notice that simple requests route back to MiniMax M2.5 — the same model that did the classification. For simple queries, the router overhead is effectively zero because the specialist is the same model and can reuse the warm connection.


The basic router works for most traffic, but production systems need a few refinements:

def route_request_production(
user_message: str,
conversation_history: list,
force_model: str = None
) -> tuple[str, str]:
"""Production router with overrides and fallback."""
# Allow explicit model override (for power users or testing)
if force_model:
model = force_model
category = "override"
else:
category = classify_request(user_message)
model = ROUTE_TABLE[category]
try:
response = client.chat.completions.create(
model=model,
messages=conversation_history + [
{"role": "user", "content": user_message}
]
)
return response.choices[0].message.content, category
except Exception:
# Fallback to GLM 4.7 if the specialist is unavailable
fallback = "glm-4.7"
response = client.chat.completions.create(
model=fallback,
messages=conversation_history + [
{"role": "user", "content": user_message}
]
)
return response.choices[0].message.content, f"{category}->fallback"

Three patterns worth noting:

  1. Force model — Let callers bypass routing when they know what they need.
  2. Fallback — If a specialist model is down, fall back to GLM 4.7. It handles everything reasonably well.
  3. Return the category — Log which route each request takes. You’ll need this data to tune the system.

Consider a workload of 1,000 requests with this distribution: 600 simple, 300 general, 70 reasoning, 30 code. Average 500 input tokens, 200 output tokens per request.

Single-model approach (everything on V3.2)

Section titled “Single-model approach (everything on V3.2)”
Avg latency
~4.5s
All 1000 reqs
V3.2 only

Every request waits for V3.2’s ~1.2s TTFT plus generation time at ~60 t/s. Simple questions get the same treatment as complex analysis.

Simple (600)
~1.2s (8B)
General (300)
~4.7s (V3.2)
Reasoning (70)
~9.0s (R1)
Code (30)
~3.5s (Coder)

The weighted average latency drops to approximately 2.7s — a 40% reduction. The 600 simple requests finish in ~1.2s instead of ~4.5s. That’s a 3.7x improvement for the majority of your traffic.

The 70 reasoning requests are slower individually (~9s vs ~4.5s) because R1 generates chain-of-thought tokens. But the quality on those specific requests is significantly better — R1 scores 50.2% on HLE versus V3.2’s 39.3%.

You get faster averages and better quality on the hard tail.


A customer support chatbot receives three types of requests:

  1. FAQ (60%) — “What are your business hours?” / “How do I reset my password?”
  2. Complex support (30%) — “I was charged twice for order #12345, can you investigate?”
  3. Technical issues (10%) — “Your API returns 500 when I send multipart form data with UTF-8 filenames”

All requests go to DeepSeek V3.2. FAQs get correct answers but with unnecessary latency. Technical issues get decent answers but miss edge cases that a code-specialized model would catch.

SUPPORT_ROUTES = {
"simple": "MiniMax-M2.5", # FAQ, greetings
"general": "glm-4.7", # Complex support
"reasoning": "glm-4.7", # Investigations
"code": "glm-4.7", # Technical issues
"agent": "kimi-k2.6", # Multi-step resolution
}

FAQs resolve quickly via MiniMax M2.5. Complex support issues get GLM 4.7’s full analytical capability. Technical problems also route to GLM 4.7, which understands the code context well. If a support issue requires looking up order data via API, it routes to Kimi K2.6 for tool-assisted resolution.

The classification step adds ~200ms. For the 60% of requests that drop from ~4.5s to ~1.2s, that’s an invisible cost.


Routing adds complexity. Skip it when:

  • All your requests are the same type. If you’re building a code editor, just use a single coding model like GLM 4.7. No routing needed.
  • You have fewer than 100 requests/day. The cost savings don’t justify the engineering overhead at low volume.
  • Latency doesn’t matter. For batch processing or async workloads, a single capable model is simpler.
  • Your classification accuracy is low. If the router misclassifies frequently, you get worse results than a single good model. Test the classifier on real traffic before deploying.

The sweet spot is high-volume applications with diverse request types — chatbots, API gateways, developer tools, and customer-facing products where response time directly affects user experience.


  1. Log your traffic. Before building a router, understand your request distribution. What percentage is simple? Complex? Code?
  2. Start with two tiers. A lighter model like MiniMax M2.5 for simple, GLM 4.7 for everything else. Add specialists only when you have data showing they help.
  3. Measure classification accuracy. Sample 100 requests, manually label them, compare against the router’s output. Target >90% accuracy.
  4. Add fallback. Every specialist route should fall back to GLM 4.7 if the specialist is unavailable.
  5. Monitor per-route metrics. Track latency, cost, and quality per category. This tells you where to optimize next.

All models in this guide are available through a single OpenAI-compatible API with no configuration changes between models. If you’re building a platform that needs LLM access for your users, see how per-key plans work.

Sources: Artificial Analysis Leaderboard · DeepSeek V3.2 · HLE Leaderboard · Kimi K2.5 Benchmarks

How to choose the right open-source model for your task

Most teams default to the biggest model available and call it a day. That works — until latency spikes, costs climb, and you realize a 8B-parameter model would have handled 60% of your requests just fine.

This guide maps common use cases to specific models, with real throughput numbers from our infrastructure. No theory — just which model to pick and why.


Use caseModelWhy
General chat / assistantsDeepSeek V3.2Best all-rounder. 85% MMLU-Pro, 73% SWE-bench, 60 t/s.
Complex reasoningDeepSeek R150.2% on Humanity’s Last Exam. Chain-of-thought built in.
Code generationQwen3 CoderPurpose-built for code. Strong on completions, refactoring, and debugging.
Agentic workflowsKimi K2.5334 t/s output, native tool use, 50.2% HLE with tools. Built for agents.
Vision / multimodalLlama 4 Scout17 active experts, 109B params, native image understanding.
Fast classificationLlama 3.1 8B~200 t/s, 0.2s TTFT. Small enough for routing, tagging, extraction.
General (budget)GLM 4.7 FlashFast inference, competitive quality. Good when V3.2 is overkill.
Long context chatMiniMax M2.5Native long-context support. Handles large documents well.
Large general + reasoningQwen3 235B235B MoE. Strong across benchmarks when you need maximum capability.
EmbeddingsBGE LargeMTEB-tested. Solid retrieval quality for RAG pipelines.

Pick: DeepSeek V3.2

DeepSeek V3.2 is the default choice for most workloads. It scores 85% on MMLU-Pro (beating Claude Opus 4.6’s 82%), 73% on SWE-bench Verified, and runs at ~60 tokens/second on our infrastructure.

Kimi K2.5
334 t/s
Llama 3.1 8B
~200 t/s
DeepSeek V3.2
~60 t/s
DeepSeek R1
~30 t/s

Good at: Broad knowledge, instruction following, multilingual, structured output. Not ideal for: Tasks that need step-by-step reasoning chains (use R1) or sub-100ms latency (use Llama 8B). Pick over alternatives when: You need a reliable general-purpose model that handles most tasks without specialization.


Pick: DeepSeek R1

R1 is a reasoning-first model. It produces explicit chain-of-thought tokens before its final answer. On Humanity’s Last Exam — a benchmark designed to be unsolvable by current models — R1 scores 50.2%, beating GPT-5.4 (41.6%) and Claude Opus 4.6 (40%).

The tradeoff is speed. At ~30 t/s, R1 is the slowest model in our lineup. That’s expected — it’s generating reasoning tokens that never appear in the final output.

Good at: Math, science, logic puzzles, multi-step problems, anything where “thinking” helps. Not ideal for: Simple Q&A, classification, or latency-sensitive applications. Pick over alternatives when: The task requires multi-step deduction. If a human would need to “think through it,” R1 will outperform faster models.


Pick: Qwen3 Coder

Qwen3 Coder is purpose-built for software engineering tasks — code completion, refactoring, debugging, and generation across languages. It’s trained specifically on code-heavy data and optimized for developer workflows.

Good at: Code completion, bug fixing, refactoring, test generation, multi-file edits. Not ideal for: General conversation or non-code tasks (use V3.2). Pick over alternatives when: Code quality matters more than general knowledge. For mixed code-and-chat workflows, V3.2 or Kimi K2.5 may be more versatile.


Pick: Kimi K2.5

Kimi K2.5 was designed for agentic use. It has native tool-calling support, runs at 334 t/s (the fastest model we serve), and scores 50.2% on HLE when using tools — matching R1’s reasoning-only score.

The speed matters for agents. Each tool call is a round trip: the model generates a function call, the tool executes, the result goes back to the model. At 334 t/s and 0.31s TTFT, Kimi completes multi-step agent loops in seconds where slower models take minutes.

Good at: Tool use, function calling, multi-step task execution, fast iteration loops. Not ideal for: Pure reasoning without tools (R1 is better). Code-only tasks (Qwen3 Coder is more specialized). Pick over alternatives when: Your application involves tool calling, API interactions, or multi-step agent orchestration where speed compounds.


Pick: Llama 4 Scout

Llama 4 Scout is Meta’s mixture-of-experts multimodal model — 109B total parameters with 17 active experts. It handles text and images natively, making it the pick for tasks that require visual understanding alongside language.

Good at: Image description, visual Q&A, document understanding, chart interpretation. Not ideal for: Text-only tasks where you’re paying for vision capability you don’t use (use V3.2). Pick over alternatives when: Your input includes images. For text-only workloads, other models are more efficient.


Pick: Llama 3.1 8B

At 8 billion parameters, Llama 3.1 8B runs at ~200 t/s with approximately 0.2s time to first token. It’s the right choice for tasks where speed matters more than depth: intent classification, sentiment analysis, entity extraction, content filtering, and request routing.

Good at: Classification, tagging, extraction, routing decisions, simple Q&A, content moderation. Not ideal for: Complex reasoning, long-form generation, or tasks requiring deep world knowledge. Pick over alternatives when: You need results in under a second and the task is well-defined. Also ideal as the router model in a multi-model architecture.


Pick: GLM 4.7 Flash

GLM 4.7 Flash delivers competitive quality at fast inference speeds. When DeepSeek V3.2 is more capability than you need — simple conversations, basic summarization, FAQ bots — GLM 4.7 Flash gets the job done efficiently.

Good at: Simple chat, summarization, translation, basic Q&A. Not ideal for: Complex reasoning or tasks where benchmark-leading quality matters. Pick over alternatives when: You want good-enough quality with better speed and lower cost than the largest models.


Pick: MiniMax M2.5

MiniMax M2.5 handles long context windows natively. For workloads that involve ingesting large documents, long conversation histories, or extensive codebases, M2.5 maintains coherence across the full context.

Good at: Document analysis, long conversations, large-context summarization. Not ideal for: Short, simple tasks where context length is irrelevant (use Llama 8B or GLM Flash). Pick over alternatives when: Your input regularly exceeds what smaller-context models handle well.


Pick: Qwen3 235B

Qwen3 235B is a large mixture-of-experts model that competes across the full benchmark spectrum. When you need the highest possible quality and latency is not the primary constraint, Qwen3 235B delivers.

Good at: Broad capability across reasoning, knowledge, and generation. Strong multilingual support. Not ideal for: Latency-sensitive applications (large model, slower inference). Pick over alternatives when: You need top-tier quality and can tolerate higher latency. Good for batch processing and offline tasks.


Pick: BGE Large

BGE Large (BAAI General Embedding) is a well-tested embedding model for retrieval-augmented generation. It performs well on MTEB benchmarks and produces dense vectors suitable for semantic search, document retrieval, and clustering.

Good at: Semantic search, RAG pipelines, document similarity, clustering. Not ideal for: Generative tasks (it’s an embedding model, not a chat model). Pick over alternatives when: You need vector embeddings for search or retrieval. Pair it with a generative model for the full RAG pipeline.


What's your task?
|
+-- Need to understand images?
| YES --> Llama 4 Scout
|
+-- Need step-by-step reasoning? (math, logic, science)
| YES --> DeepSeek R1 (~30 t/s, but highest reasoning quality)
|
+-- Need tool calling / agent loops?
| YES --> Kimi K2.5 (334 t/s, native tool use)
|
+-- Need code generation / editing?
| YES --> Qwen3 Coder (purpose-built for code)
|
+-- Need embeddings for search/RAG?
| YES --> BGE Large
|
+-- Need sub-200ms response?
| YES --> Llama 3.1 8B (~200 t/s, 0.2s TTFT)
|
+-- Need long context (large documents)?
| YES --> MiniMax M2.5
|
+-- Need maximum quality, latency flexible?
| YES --> Qwen3 235B
|
+-- General purpose, good balance?
YES --> DeepSeek V3.2 (default choice)

You don’t need ten models to cover most workloads.

Llama 3.1 8B handles 60% of requests. Classification, routing, simple Q&A, extraction, content filtering. Fast and cheap.

DeepSeek V3.2 handles 30%. General chat, complex instructions, knowledge-intensive tasks. The reliable all-rounder.

Specialized models handle the last 10%. R1 for hard reasoning. Kimi K2.5 for agent loops. Qwen3 Coder for code. BGE Large for embeddings.

Start with Llama 8B + V3.2. Add specialists only when you have evidence that general models aren’t performing on specific task categories. Measure first, specialize second.


This guide is provider-agnostic. CheapestInference serves a focused lineup — Kimi K2.6, GLM 4.7, and MiniMax M2.5 — through a single OpenAI- and Anthropic-compatible API. If you want unlimited inference during your reserved hours, see how time-block pools work.

Sources: Artificial Analysis Leaderboard · SWE-bench Leaderboard · Kimi K2.5 Benchmarks · DeepSeek V3.2 · HLE Leaderboard · MMLU-Pro Leaderboard · MTEB Leaderboard