Skip to content

The real cost of running AI agents in production

Chatbots are cheap. Agents are not.

A chatbot sends a user message, gets a response, displays it. Maybe 2,000 tokens per exchange. An agent reads files, calls tools, retries on errors, re-sends the entire conversation every step, and does this 20–60 times per task. Same API, completely different economics.

If you’re budgeting for AI agents the same way you budget for a chatbot, you’re underestimating by 10–50x.


We measured token consumption across three workload types, each running for one hour:

Coding agent (OpenClaw)
~2.1M tokens
Research agent (CrewAI)
~1.2M tokens
RAG chatbot
~200K tokens
Simple chatbot
~40K tokens

The coding agent consumed 52x more tokens than a simple chatbot in the same time period. And this is normal — the agent was doing useful work the entire time.


Three architectural properties of agents make them expensive:

Every agent step appends tool outputs to the conversation. The LLM re-processes the entire conversation on each step. If the agent reads a 3,000-token file at step 5, that file gets re-sent at steps 6, 7, 8… all the way to the end.

For a 40-step task, one file read costs: 3,000 tokens × 35 remaining steps = 105,000 tokens in re-transmission.

This is why agent token consumption grows quadratically, not linearly.

Agent frameworks use large system prompts — OpenClaw’s is ~9,600 tokens, CrewAI’s varies by agent configuration. This prompt is sent with every request. Over 40 steps, the system prompt alone costs 384,000 tokens.

When a tool call fails, the agent retries. Each retry sends the full context plus the error message. Three retries on a 30K-token context wastes 90K tokens with no productive output.

Without a retry cap, this can run indefinitely — always bound agents with a retry cap and a maximum iteration count.


Assuming one developer running 15 agent tasks per day, 22 working days per month, ~500K tokens per task:

Model Cost/task Daily (×15) Monthly
Claude Opus 4.6 $9.18 $137.70 $3,029
Claude Sonnet 4.6 $2.25 $33.75 $743
GPT-5.4 $4.73 $70.95 $1,561
DeepSeek V3.2 $0.16 $2.40 $53
Qwen 3.5 35B $0.04 $0.60 $13
CheapestInference (full day) from $39 flat

A team of 5 developers each running 15 tasks/day on Claude Opus spends $15,145/month. The same team on flat-rate via CheapestInference pays a fixed monthly subscription per seat (from $39 for a reserved daily time block) — no matter how many tokens those agents burn. That’s an order-of-magnitude reduction.


Four strategies to cut agent inference costs

Section titled “Four strategies to cut agent inference costs”

DeepSeek V3.2 and Qwen 3.5 score within 4 points of GPT-5.4 and Opus on most benchmarks. For coding tasks specifically, DeepSeek V3.2 matches Opus on HumanEval and SWE-bench. Full data: Open-source models are production-ready.

Not every agent step needs a frontier model. File reads, simple classifications, and formatting don’t need 685B parameters. Use a small model for easy steps and a large model for hard ones. Full guide: Building a multi-model architecture.

Give each agent its own API key so one runaway agent can’t starve the others. On a time-block subscription each key gets unlimited usage during its reserved hours, so you isolate workloads without juggling per-token allocations.

Per-token pricing penalizes the exact patterns agents use: large contexts, many steps, retries. Flat-rate pricing makes all of that free. During your reserved time blocks your agent can use the full context window and retry freely without increasing the bill — reserve all three blocks for 24/7 coverage.


Here’s the equation most teams miss:

Agent cost = tokens_per_step × steps × cost_per_token

Most optimization focuses on cost_per_token — switching to a cheaper model. But tokens_per_step grows with context (quadratic), and steps is unpredictable. Optimizing only one variable leaves the other two working against you.

Flat-rate pricing eliminates all three variables from your bill. The cost is the subscription. Period.


We serve Kimi K2.6, GLM 4.7, and MiniMax M2.5 with flat-rate, unlimited time-block subscriptions — no token counting, no budget caps during your reserved hours. Reserve 1–3 daily 8-hour blocks from $39/month and your agent’s token consumption never becomes your problem. Get started or see plans.