OpenClaw is free. Running it is not.

Mar 24, 2026

OpenClaw has 247,000 GitHub stars. It’s free, open-source, and runs locally. You install it, point it at an LLM, and it writes code, browses the web, queries databases, and executes files on your behalf.

The agent is free. The inference is not.

Every time OpenClaw calls a model, it re-sends the entire conversation history — every tool output, every file it read, every intermediate result. By iteration 20 of a typical task, the input context is 30,000+ tokens. By iteration 40, it’s past 100,000. And it sends this every single request.

This is not a bug. It’s how agents work. And it’s why running OpenClaw on pay-per-token APIs costs $300–600/month for active users — sometimes more.

Where the tokens go

We broke down token consumption for a typical OpenClaw coding task: “add authentication to an Express API.” The agent completed it in 38 tool calls.

Context accumulation

~280K tokens

System prompt (×38)

~156K tokens

Tool outputs (files, etc.)

~70K tokens

Agent output

~19K tokens

Total: ~525,000 tokens for a single task. The agent’s actual output — the code it wrote — was 19K tokens. The other 96% is overhead.

On Claude Opus at $15/M input + $75/M output, that single task costs $9.18. Run five tasks a day and you’re at $1,377/month.

On DeepSeek V3.2 via a pay-per-token provider at $0.27/M input + $1.10/M output, the same task costs $0.16. Better — but 20 tasks a day is still $96/month, and that’s one agent.

The three cost traps

Here’s the OpenClaw-specific version:

1. Context grows quadratically

OpenClaw reads files into context. If it reads a 2,000-token file at step 5, that file gets re-sent at steps 6, 7, 8… all the way to 38. That single file read costs 2,000 × 33 remaining steps = 66,000 tokens in re-transmission alone.

Users report session contexts at 56–58% of the 400K context window during normal use. This isn’t a failure mode — it’s the architecture working as designed.

2. System prompt is a fixed tax

OpenClaw’s system prompt is ~9,600 tokens. It gets sent with every request. Over 38 tool calls, that’s 365K tokens just in system prompts. You pay this whether the agent does useful work or not.

3. Wrong model for the job

OpenClaw defaults to a single model for everything. But not every tool call needs the same intelligence:

Reading a file and deciding what to edit? Llama 3.1 8B handles this at 200 tokens/sec.
Writing complex authentication logic? A frontier open-weight model like Kimi K2.6 is the right call.
Formatting a config file? Any 8B model is overkill but still cheaper than Opus.

We wrote a full guide on this pattern: Building a multi-model architecture. Routing agent requests to the right model can cut costs by 60–80% without reducing output quality.

The math on flat-rate vs. pay-per-token

Here’s the comparison for an OpenClaw user running ~20 tasks/day:

Provider	Cost/task	20 tasks/day	Monthly
Claude Opus (direct)	$9.18	$183.60	$5,508
GPT-5.4 (direct)	$4.73	$94.60	$2,838
DeepSeek V3.2 (per-token)	$0.16	$3.20	$96
CheapestInference	—	—	from $39/mo

Flat-rate means you don’t care about context accumulation. The 280K tokens of context overhead that makes pay-per-token expensive? Irrelevant. The system prompt tax? Doesn’t matter. Your agent can call models 24/7 and the bill is the same.

If you’re running OpenClaw, here’s the setup we see working best:

1. Use open-weight models. Frontier open-weight models like Kimi K2.6 and GLM 4.7 score within a few points of proprietary models on coding benchmarks (the data). The gap doesn’t justify a 50x cost difference.

2. Route by complexity. Don’t send file reads and simple decisions to the same model as complex code generation. A router model costs fractions of a cent per classification. Full guide: Multi-model architecture.

3. Reserve the hours you work. On CheapestInference you reserve one or more daily 8-hour time blocks (Asia-Pacific, Europe, Americas — pick 1–3, all three is full 24/7). During your reserved hours inference is unlimited with no budget cap. One API key per agent, one concurrent request per key. Outside your window, requests return 429 until your block opens again.

4. Handle rate limits automatically. Time blocks mean your agent will hit 429s outside your reserved window — that’s expected. But OpenClaw kills the conversation when it gets a 429. The agent stops, and if you close the dashboard, that conversation is gone.

We built an OpenClaw plugin that fixes this: openclaw-ratelimit-retry. It hooks into agent_end, detects retriable 429s, parks the session on disk, and waits for the budget window to reset. Then it sends chat.send to the original session — resuming the conversation with its full transcript, as if you had typed a message.

openclaw plugins install @cheapestinference/openclaw-ratelimit-retry

plugins:
  ratelimit-retry:
    budgetWindowHours: 8    # matches your CheapestInference 8-hour time block
    maxRetryAttempts: 3     # give up after 3 consecutive 429s
    checkIntervalMinutes: 5 # check every 5 min for ready retries

The plugin is zero-dependency, persists across server restarts, deduplicates by session, and handles edge cases like sub-agents, queue overflow, and corrupted state files. If the retry itself hits a 429, it re-queues automatically. No tokens wasted on re-sending from scratch — the agent picks up exactly where it left off.

This turns budget caps from “your agent crashes” into “your agent naps and wakes up.” Set it up once and forget about it.

5. Consider unlimited time blocks. If your agent runs more than a few tasks per day, per-token pricing works against you. Every token of context overhead is money. With an unlimited time-block subscription, context overhead is free during your reserved hours — re-send the full window, let the agent work without a budget cap.

The irony

OpenClaw is free because the code runs on your machine. But the valuable part — the intelligence — runs on someone else’s GPUs. The agent framework is the cheap part. Inference is the expensive part.

Open-source models on flat-rate infrastructure flip this equation. The models are free. The inference is flat. The only variable cost left is your time.

Point your OpenClaw base_url at https://api.cheapestinference.com/v1 and find out what unconstrained agents actually cost: nothing more than you already budgeted.