Cutting my AI agent's token bill in half

My weekend project agent was burning through credits faster than it was shipping useful output. After a month of profiling, six changes cut the bill roughly in half without dropping quality.

1. Stop re-sending the system prompt

The biggest single win. My orchestrator was rebuilding a 4,000-token system prompt on every step. With prompt caching turned on, the cached prefix bills at a fraction of the regular input rate. I went from paying full price on every tool call to paying full price once per session.

2. Use cheap models for the boring parts

Routing, classification, JSON parsing, and short summarization don't need a frontier model. A small model handles them at a fraction of the cost. The pipeline now has three tiers: cheap (routing), default (drafting), and premium (final review).

3. Stream and short-circuit

Streaming alone doesn't save money, but combined with a stop condition it does. If the response opens with "I cannot do this" or hits a known bad pattern, I cancel the generation before paying for the rest of the tokens.

4. Shrink tool schemas

JSON-schema for tools sits in the context window on every call. A bloated schema with verbose descriptions inflates every single request. I tightened mine down to one-line descriptions and dropped 600 tokens off every step.

5. Truncate the trajectory window

I was passing the full agent history into every step. Past tool calls and observations from 15 steps ago weren't helping the next decision. A sliding window of the last 4 steps plus a compressed summary worked just as well.

6. Pick the right hosted endpoint

Same model, different providers, real price spread. Worth checking bestaipacks.com for current per-token rates across providers before locking in.

Numbers

Change	Effective savings
Prompt caching	~35%
Tier routing	~25%
Tighter tool schemas	~8%
Trajectory truncation	~12%

Compounding effect was bigger than the sum because shorter contexts also made the cheap models more reliable, which let me route more work to them.

If you write prompts for your agents and want a sanity check before shipping them, fixmyprompt.net catches the things that quietly inflate token counts.

Cutting my AI agent's token bill in half

1. Stop re-sending the system prompt

2. Use cheap models for the boring parts

3. Stream and short-circuit

4. Shrink tool schemas

5. Truncate the trajectory window

6. Pick the right hosted endpoint

Numbers

Related

Comments

Leave a comment

1. Stop re-sending the system prompt

2. Use cheap models for the boring parts

3. Stream and short-circuit

4. Shrink tool schemas

5. Truncate the trajectory window

6. Pick the right hosted endpoint

Numbers

Related

Related posts

Garver Family - I built a tool to QA my own AI prompts. Then I shipped it.

Multi-Agent AI Systems: How Coordinated Intelligence is Reshaping Enterprise Operations

Local-first AI tools beat cloud chatbots for development

Comments

Leave a comment