What it takes to build your own LLM inference platform
If you’re building a SaaS that needs to give users access to LLMs, you have two options: build the infrastructure yourself, or use a platform that does it for you. Here’s what “build it yourself” actually looks like.
This isn’t theoretical. We built this. Here’s every component, what it does, and what alternatives exist.
0. Model access — the first problem
Section titled “0. Model access — the first problem”Before you write a single line of code, you need access to models.
Self-host on your own hardware: Buy GPUs, rent datacenter space, run the models yourself. Full control, best unit economics at scale — but massive upfront cost and you’re limited to the models you can afford to deploy. Running DeepSeek V3.2 requires multiple high-end GPUs. Running 70+ models? You’d need a data center.
Rent infrastructure: Use GPU clouds like Vast.ai, AWS, Hetzner, CoreWeave, or Lambda. No hardware to buy, but you still manage deployments, scaling, and failover. Costs add up fast — a single H100 runs $2-4/hr.
Use an inference provider: Sign agreements with DeepInfra, Together.ai, Fireworks, etc. who already have the models deployed. Pay per token, no GPU management. But you depend on their availability, pricing, and terms. If they change prices or drop a model, you need a plan B.
Mix: Most serious platforms end up here. Own hardware for high-volume models where the unit economics justify it, rented GPUs for burst capacity, and provider agreements for the long tail of models nobody runs enough to self-host.
Self-hosting 70+ models on your own is economically unrealistic. The real question is where to draw the line between own infra, rented compute, and providers.
1. Serving engine
Section titled “1. Serving engine”If you self-host or rent GPUs, you need software to serve the models:
- vLLM — most popular, good throughput, active community
- TGI (Text Generation Inference) — Hugging Face’s solution, solid for single-model deployments
- TensorRT-LLM — NVIDIA’s optimized engine, best raw performance but harder to set up
- SGLang — newer, fast, good for structured generation
You’ll also need to handle model weights, quantization, scaling across GPUs, and failover when a node goes down. This is a full-time ops job.
2. API proxy layer
Section titled “2. API proxy layer”Your users shouldn’t hit the inference backend directly. You need a proxy that:
- Translates between API formats (OpenAI, Anthropic)
- Routes requests to the right model/provider
- Injects authentication
- Handles retries and failover
- Strips provider headers so users don’t know your backend
Options:
- Build from scratch with Express/Fastify + http-proxy-middleware
- Use an open-source gateway: LiteLLM, Portkey, Kong AI Gateway, MLflow Gateway
- Use a managed gateway: Helicone, Braintrust, Promptlayer
Each has trade-offs. Open-source gateways give you control but you manage the deployment. Managed gateways are easier but add latency and cost.
3. Authentication
Section titled “3. Authentication”Two layers:
User auth (dashboard login)
- Firebase Auth, Auth0, Clerk, Supabase Auth, or roll your own
- Supports email, Google, GitHub, wallet signatures
API key auth (inference requests)
- Generate API keys per user
- Validate on every request before proxying
- Store key metadata (plan, rate limits, owner)
This is where it gets interesting for platforms. You need per-key plans — each key with its own rate limits and usage tracking. Most auth solutions don’t do this out of the box. You’ll need a custom key management layer.
4. Rate limiting
Section titled “4. Rate limiting”Per-key rate limiting with at least:
- RPM (requests per minute)
- TPM (tokens per minute)
- Budget caps (dollar amount per time window)
This needs to be enforced at the proxy layer, before the request hits the inference backend. Otherwise a single user can exhaust your GPU allocation.
Options:
- Redis-based counters (most common)
- Token bucket algorithms
- Proxy-level enforcement (some gateways include this)
If you’re using per-key plans, each key needs its own set of limits. Not one global limit — individual limits per key.
5. Usage tracking and billing
Section titled “5. Usage tracking and billing”You need to know:
- How many tokens each key consumed (input + output)
- What model was used
- Cost per request
- Aggregate usage per user, per day, per billing period
For subscription billing:
- Stripe for card payments
- Budget windows (e.g., $X per 5-hour period)
- Automatic key revocation when subscription expires
For pay-as-you-go:
- Credit balance per user
- Deduct per request based on token count × model price
- Top-up flow (Stripe, crypto, etc.)
For crypto payments:
- USDC on a supported chain
- On-chain transaction verification
- Wallet connector in the dashboard (wagmi, viem, etc.)
This is a significant amount of code. Usage tracking alone requires intercepting every response to count tokens, calculating cost based on the model’s pricing, and storing it per key.
6. Dashboard
Section titled “6. Dashboard”Your users need a web UI to:
- Create and manage API keys
- View usage per key (tokens, requests, cost)
- Subscribe to plans or top up credits
- See available models and pricing
Tech stack typically:
- React/Next.js/Vue frontend
- REST API backend
- Real-time usage updates
For platforms (your users creating keys for their users), you also need a management API — programmatic key creation, plan assignment, usage queries.
7. Model catalog management
Section titled “7. Model catalog management”Models change. New ones come out weekly. You need:
- A catalog of which models you serve
- Pricing per model (input/output cost per token)
- Sync mechanism to update prices when providers change them
- Display names, categories, tags for the dashboard
- Cache pricing metadata (some models support prompt caching discounts)
This is an ongoing operational burden, not a one-time setup.
8. Documentation
Section titled “8. Documentation”Your users need:
- API reference (endpoints, request/response formats)
- SDK examples (Python, Node.js, at minimum)
- Authentication guide
- Billing/usage documentation
- Quick start guide
This is easily 20-30 pages of documentation that needs to stay current.
9. Monitoring and reliability
Section titled “9. Monitoring and reliability”- Health checks on the inference backend
- Status page for users
- Alerting when latency spikes or errors increase
- Logging (but not logging prompt content — privacy)
- Graceful degradation when a model or provider is down
10. Compliance and privacy
Section titled “10. Compliance and privacy”- Privacy policy
- Data handling documentation
- GDPR compliance if you serve EU users
- Decision: do you store prompts? (You shouldn’t)
- SOC 2 / ISO 27001 if targeting enterprise
The full stack
Section titled “The full stack”| Component | Ongoing maintenance |
|---|---|
| Inference backend | High — scaling, failover, model updates |
| API proxy | Medium — format changes, new providers |
| Auth + key management | Low |
| Per-key rate limiting | Low |
| Usage tracking + billing | Medium — edge cases, reconciliation |
| Dashboard | Medium — new features, UX |
| Model catalog | High — weekly model updates |
| Documentation | Medium — keep current |
| Monitoring | Low |
| Privacy/compliance | Low |
What breaks in production
Section titled “What breaks in production”Building is the easy part. The hard part is what breaks with real users:
- A provider changes their API format without warning. Your proxy returns 500s for 2 hours until you notice.
- A model gets deprecated. Your users’ hardcoded model IDs stop working overnight.
- Token counting has an off-by-one bug. You’ve been undercharging for 3 weeks. Your margin is gone.
- A user finds a way to exceed rate limits through concurrent requests. Your inference bill spikes 10x in one afternoon.
- Stripe webhook fails silently. A user’s subscription expired but their API key still works. Free inference for a month.
- You push a billing update and break the usage tracking. Three days of missing data. Users open tickets.
Each of these has happened to us. We fixed them. The question is whether you want to fix them yourself, with your users waiting, or use a platform that already has.
You use an inference platform that already has all of this, create API keys for your users, and ship your product this week.
We built all of the above so you don’t have to. See how per-key plans work.