Skip to main content
← Back to blog

Token optimization with Claude — how we cut cost 8× without losing quality

Prompt caching, the right model, well-typed tool use, and short contexts. Four patterns that turn a three-digit monthly bill into a two-digit one.

When we shipped the first version of Agentikas's AI writer, every generated post cost 25 cents. Three versions per post (blog, LinkedIn, X), translation to English, brand review — when an author published twice a week in two languages, the monthly bill was approaching three digits per user. For a non-profit platform that's free at the base tier, that wasn't sustainable.

Today the same post costs 3 cents. Quality went up, not down. These are the four patterns that made the difference.

1. Prompt caching for skills and BRAND

First mistake was trivial but expensive. Every generation sent as context:

  • The blog's BRAND.md (~1,500 tokens)
  • The matching skill — blog-writer.md, linkedin-writer.md, x-writer.md (~2,000 tokens each)
  • The design-system.md (~3,000 tokens)
  • The user prompt (~200 tokens)

Total: ~6,700 input tokens per generation, of which 6,500 were identical across calls. We were paying to resend the same context every time.

The fix is in Anthropic's API: prompt caching. Mark a system-prompt block as cacheable, and the next calls within the next 5 minutes pay ~10% of input cost for that block.

const messages = await anthropic.messages.create({
  model: "claude-haiku-4-5",
  system: [
    {
      type: "text",
      text: BRAND_MD + DESIGN_SYSTEM_MD + BLOG_WRITER_SKILL,
      cache_control: { type: "ephemeral" }, // ← the magic line
    },
  ],
  messages: [{ role: "user", content: userPrompt }],
});

First-generation cost: normal. Subsequent generations within 5 minutes: ~10% of cached input. When an author generates blog + LinkedIn + X + translation = 6 calls in 30 seconds, the savings hit 50% just from this.

One important note: cache TTL is 5 minutes by default. If the author pauses half an hour and comes back, the cache evaporates and you pay full price again. For batch flows (publishing all queued posts in one go), grouping generations close in time exploits the cache best.

2. The right model for each task

Second mistake: we used Claude Sonnet 4.6 for everything. Post generation, brand review, translation, slug generation. Sonnet is the sweet spot of quality/price for creative tasks; it's overkill for "turn 'How to build an AI-first blog' into a URL-safe slug."

The change: controlled downgrade to Haiku 4.5 for structured tasks, keep Opus or Sonnet only for creative ones.

TaskModelWhy Generate post bodySonnet 4.6Writing quality, voice, narrative Adapt to LinkedIn / XHaiku 4.5Short reformulation, high volume Generate slug, meta description, tagsHaiku 4.5Structured output, high restriction Translate full postSonnet 4.6Translation quality, terminology Brand review (LLM, optional)Haiku 4.5Validation, not creation

Haiku is ~5× cheaper per token than Sonnet. For tasks where creative quality adds no value, the downgrade is invisible to the user and enormous for the bill.

3. Strict tool use = zero retries

When a generation came back malformed — markdown where you expected HTML, a stray quote — the code retried with an extra "please return in correct format" prompt. That doubled that generation's cost. For one in twenty, it was noise. With prolific users, it was 5% of the whole bill.

The pattern that kills it is tool use with strict JSON Schema (covered in another prompt-engineering post). Define the exact shape of the output, the model commits to respecting it, malformed retries drop to zero.

Beyond retry savings, the model processes faster when it knows the exact output shape. Latency dropped 20–30% in our measurements, and in token-out billing, that's fewer wasted decorative tokens.

4. Smart batching and parallelization

When an author clicks "publish," the platform generates three content versions. We used to do this sequentially: generate blog → done → generate LinkedIn → done → generate X. Total time: ~90 seconds.

The change was parallelizing the three generations. Each is independent — same BRAND and design-system, different skill. Fire all three in parallel:

const [blog, linkedin, x] = await Promise.all([
  generateBlog(topic, brand),
  generateLinkedIn(topic, brand),
  generateX(topic, brand),
]);

Total time: ~30 seconds (the slowest one). The user doesn't wait. But there's a non-obvious detail: the three calls share the BRAND and design-system cache if you make them within the same 5-minute window. The first one pays the cache cost; the next two pay ~10% of cached input. Parallelization + cache = compound win.

The result, in a monthly bill

For an active author (3 posts/week, two languages, all platforms):

  • Before: ~$45/month in Claude per user
  • Now: ~$5/month in Claude per user

That means the platform can absorb the cost of much higher user volumes without changing the model (free at base tier, sustainable for Agentikas Labs as a non-profit).

The savings didn't come from a trick — they came from four independent patterns that compose. Any project using Claude's API at scale should have all four turned on. The difference between having one and having all is the 8× factor that separates a sustainable bill from one that isn't.


Prompt-caching techniques are documented in Anthropic's official docs. The Agentikas runtime that applies them is in @agentikas/ai, part of the open-source monorepo.

Comments

Loading comments…