Cutting my AI agent's token bill in half
Six things that actually moved the needle on a multi-agent pipeline running on OpenRouter.
My weekend project agent was burning through credits faster than it was shipping useful output. After a month of profiling, six changes cut the bill roughly in half without dropping quality.
1. Stop re-sending the system prompt
The biggest single win. My orchestrator was rebuilding a 4,000-token system prompt on every step. With prompt caching turned on, the cached prefix bills at a fraction of the regular input rate. I went from paying full price on every tool call to paying full price once per session.
2. Use cheap models for the boring parts
Routing, classification, JSON parsing, and short summarization don't need a frontier model. A small model handles them at a fraction of the cost. The pipeline now has three tiers: cheap (routing), default (drafting), and premium (final review).
3. Stream and short-circuit
Streaming alone doesn't save money, but combined with a stop condition it does. If the response opens with "I cannot do this" or hits a known bad pattern, I cancel the generation before paying for the rest of the tokens.
4. Shrink tool schemas
JSON-schema for tools sits in the context window on every call. A bloated schema with verbose descriptions inflates every single request. I tightened mine down to one-line descriptions and dropped 600 tokens off every step.
5. Truncate the trajectory window
I was passing the full agent history into every step. Past tool calls and observations from 15 steps ago weren't helping the next decision. A sliding window of the last 4 steps plus a compressed summary worked just as well.
6. Pick the right hosted endpoint
Same model, different providers, real price spread. Worth checking bestaipacks.com for current per-token rates across providers before locking in.
Numbers
| Change | Effective savings |
|---|---|
| Prompt caching | ~35% |
| Tier routing | ~25% |
| Tighter tool schemas | ~8% |
| Trajectory truncation | ~12% |
Compounding effect was bigger than the sum because shorter contexts also made the cheap models more reliable, which let me route more work to them.
Related
If you write prompts for your agents and want a sanity check before shipping them, fixmyprompt.net catches the things that quietly inflate token counts.
Comments
No comments yet — be the first.