Agents
A2A
Ollama
Qwen
Anthropic
Architecture
Local-LLM
Labs

Day 4 — When Anthropic blocks you: why every A2A agent needs a local MODE

Day 4 was meant to test the 5-agent pipeline against real Claude. It ended with 4 hours fighting Anthropic billing, a clear architectural lesson, and an Ollama + Qwen 2.5 Coder 32B integration that's worth more than the original plan.

Salvador Morales · May 3, 2026

LinkedIn X

Day 4 — When Anthropic blocks you

Every external dependency in your system needs a fallback path. The first time you learn it the hard way it'll cost you an afternoon. The second time it'll cost you a customer.

Day 1 left the A2A architecture with three services running. Day 2 added the Project Explorer (Tech Lead). Day 3 introduced the MCP + A2A pattern: a Codebase MCP server exposed at :9000 that the Project Mapper consumes to produce structured maps of the repo. Day 4 was going to be the culmination: assemble the Feature Developer (TDD agent), connect the 5 agents into a pipeline against real Claude via the API, and close with a celebratory blog post.

It wasn't that. Day 4 was 4 hours fighting Anthropic billing, an architectural lesson I wish I'd learned earlier, and an integration with Ollama + Qwen 2.5 Coder 32B that I now consider permanent infrastructure — not a temporary fallback.

This post is the real path, with the bugs and learnings normally edited out of polished blog posts.

What happened

It was 12:00. 4-agent pipeline ready in mock mode. Time to flip MODE=live in .env and fire the first real chat with Claude Sonnet 4.6 against the Agentikas blog as target repo. Estimated cost: $0.10 per chat. I had $5 in "Credit grant" balance per the console — plenty for 50 iterations.

The first npm run chat returned:

{
  "error": "project-mapper failed: LLM/MCP call failed: 400 ...
            Your credit balance is too low to access the Anthropic API."
}

Four hours later, after:

3 clicks on "Buy credits"
2 new API key generations from different workspaces
1 detailed screenshot of the Anthropic console
1 paid invoice from hours ago
~15 direct API probes with curl

…the account was still blocked with the same 400 error.

I ended up discovering things about Anthropic billing that aren't in their public documentation — or are buried deep. I'm leaving them here because you'll trip over at least one if you build something serious on their API.

What I learned about Anthropic billing

Trap 1 — "Credit grant" ≠ paid credits

Anthropic gives credits as onboarding and promo (typically $5-6 per grant). You see them in your balance. They are not spendable until you make ONE real paid purchase. It's anti-abuse protection.

The confusing detail: in the invoice history, grants show up with status "Paid". It doesn't mean you paid — it means the grant was "settled" (granted). It's one of the worst-chosen UI strings I've seen in billing software.

Invoice history:
  May 2, 2026  | Credit grant  | Paid                | US$6.05
  May 1, 2026  | Credit grant  | Expiring May 2, 2027| US$6.05
  Mar 2, 2026  | Monthly invoice | Paid              | US$0.00  ← this one is real

The line "Monthly invoice — Paid" is the only one indicating a real charge to your card. The rest are grants that inflate the balance but unlock nothing.

Trap 2 — Orgs are isolated for billing

An API key belongs to one specific organization. So do credits. If you add credits to org A but your key is from org B, the key has no balance even though visually "your account has $X".

Diagnostic clue: the anthropic-organization-id header in any API response (including error responses) tells you which org the key belongs to:

curl -i https://api.anthropic.com/v1/messages \
  -H "x-api-key: $ANTHROPIC_API_KEY" \
  -H "anthropic-version: 2023-06-01" ...
# Response headers:
# anthropic-organization-id: 3d101be7-0a14-4457-b836-72b7906ffa61

Compare that UUID with the org where your credits live. If they don't match, generate a new key from the right org.

Trap 3 — Stripe can fail silently due to incomplete address

This was my final clue. After 3 attempts at "Buy credits" where the modal seemed to complete but the red banner persisted, I discovered that the Address field of the payment method was empty.

In EU/UK, charges without billing address generate automatic rejection in Stripe's anti-fraud engine (Radar) — before reaching the bank. You don't get a failure email because for Stripe it never started. Anthropic, on its side, gives you a courtesy grant every time the charge silently fails. Balance goes up, banner persists, API stays blocked.

Fix by filling in: name, full address with postal code, click Update, back to Buy credits. That's the difference between a real paid charge and another grant.

Trap 4 — Tiers exist but aren't always clear

Anthropic classifies accounts in tiers based on accumulated spend:

Tier 1: up to $100/mo
Tier 2: up to $500/mo
Tier 3+: more

Your tier limits monthly spend and rate limits. Usually not the blocker on new accounts, but explains unexpected ceilings as you scale.

Why this block changed my architecture

After the fourth hour mentally drafting support@anthropic.com, I took a coffee and saw the situation with perspective:

My entire A2A system, with 5 agents designed to be independent, was blocked in its entirety by a single external dependency: the Anthropic API.

That's an obvious architectural violation. Day 1 we established "schema-as-policy" as a principle: business invariants live in schemas, not in wikis. But I forgot the corollary:

Every external dependency of an agent needs an alternative MODE. Mock (canned) isn't enough — there's an intermediate layer missing: same I/O shape, different cognitive engine, no external dependency.

The missing layer was one with real cognitive quality but no external provider. Only a local LLM gives you that.

The choice: Ollama + Qwen 2.5 Coder 32B

Two decisions to make: which runtime and which model.

Runtime: Ollama

Ollama is the "Docker of local LLMs": a simple binary that downloads open source models and exposes them through an HTTP API OpenAI-compatible. That's key: the same OpenAI SDK (or a direct fetch) talks to your local Ollama.

brew install ollama
brew services start ollama          # runs as a service on :11434
ollama pull qwen2.5-coder:32b       # ~20GB download

Model: Qwen 2.5 Coder 32B

Three options evaluated:

ModelRAMCode qualityM4 Max speed Llama 3.1 8B~5GBmedium~80 tok/s Llama 3.1 70B~40GBgood~10 tok/s Qwen 2.5 Coder 32B~20GBvery good~40 tok/s Qwen 2.5 Coder 70B~40GBexcellent~10 tok/s

Qwen 2.5 Coder 32B wins on code-specific benchmarks (HumanEval, SWE-bench) against Llama 3.1 70B with half the parameters. It's the natural model for our architecture because 4 of 5 agents do code- or structured-reasoning tasks.

On a Mac M4 Max with 48GB RAM:

Qwen Coder 32B Q4 (default): ~20GB in use
Headroom: ~25GB for Docker (6 containers ≈ 1-2GB) + system + browser
Speed: ~30-50 tokens/s on integrated GPU
Full 5-step pipeline: ~2-4 minutes vs 30-60s with Anthropic API

It's 4-8× slower, but 100% free and private.

The integration: three MODEs per agent

The architectural change is simple: every LLM agent now has three modes selectable by env var:

INVESTIGATOR_MODE=mock   # canned response (testing/CI)
INVESTIGATOR_MODE=live   # Anthropic SDK (Claude Sonnet/Opus)
INVESTIGATOR_MODE=local  # Ollama via OpenAI-compat API

In each agent's code, the switch:

let textContent: string;
let usage = { input_tokens: 0, output_tokens: 0 };

if (MODE === "mock") {
  textContent = mockReport(topic);
} else if (MODE === "local") {
  const r = await callLocalLLM(skillContent, userPrompt, 2000);
  textContent = r.text;
  usage = r.usage;
} else {
  // live = Anthropic
  const response = await claude!.messages.create({ ... });
  textContent = extractText(response);
  usage = extractUsage(response);
}

And the callLocalLLM function is ~25 lines of pure fetch:

async function callLocalLLM(
  system: string,
  user: string,
  maxTokens: number,
): Promise<{ text: string; usage: { input_tokens: number; output_tokens: number } }> {
  const r = await fetch(`${OLLAMA_URL}/chat/completions`, {
    method: "POST",
    headers: { "content-type": "application/json" },
    body: JSON.stringify({
      model: OLLAMA_MODEL,
      messages: [
        { role: "system", content: system },
        { role: "user", content: user },
      ],
      max_tokens: maxTokens,
      temperature: 0.2,
    }),
    signal: AbortSignal.timeout(180_000),
  });
  if (!r.ok) throw new Error(`Ollama API ${r.status}: ${await r.text()}`);
  const data = await r.json();
  return {
    text: data.choices[0].message.content,
    usage: {
      input_tokens: data.usage?.prompt_tokens ?? 0,
      output_tokens: data.usage?.completion_tokens ?? 0,
    },
  };
}

No SDK, no new dependencies. Native fetch in Node 22.

The `host.docker.internal` trick

The agents run in Docker containers. Ollama runs on the host (your Mac). From inside a container, localhost is the container, not the host. Docker Desktop on macOS exposes the host under the alias host.docker.internal automatically.

That's why the URL in .env:

OLLAMA_BASE_URL=http://host.docker.internal:11434/v1

Transparent. Zero additional network configuration.

Honest tradeoffs

What you gain with local

$0 cost: not even $0.10 per chat. With a powerful Mac, infinite free iterations.
Real privacy: the repo's code never leaves your machine. Critical for sensitive codebases, NDAs, or strict HIPAA/GDPR.
Resilience: your pipeline works without internet, without API keys, without provider status pages going red.
Reasonable determinism: with temperature: 0.2, same input ≈ same output. Safe retries.

What you lose vs Claude

~30% cognitive quality on complex tasks. For code generation with strict TDD, Sonnet 4.6 still wins. Qwen Coder 32B is very good but doesn't match.
Speed: pipeline 4× slower. ~3 min vs ~45s.
No native MCP: the Anthropic SDK has MCP integrated; Ollama doesn't. For agents using MCP (Project Mapper in our architecture), you'd implement the agentic loop manually. We leave it mocked when MODE=local.
No Anthropic beta features: prompt caching, web search tool, computer use. Nonexistent in open source.

Decision table

CaseRecommended MODE CI integration testsmock Live demo, no guaranteed internetmock or local Fast development iterationmock (instant) or local (real quality) End-to-end behavior verification with LLMlocal Final client, max qualitylive Repo with strict NDA/GDPRlocal (no debate) Anthropic down / billing blockedlocal (what happened today)

The architectural lesson worth the post

Every external dependency of an agent needs three modes: mock, local, live.

Mock gives you speed and cost-free CI.

Local gives you resilience and real cognitive verification without external dependency.

Live gives you maximum quality.

If you only have mock+live, a provider outage paralyzes you. If you only have local+live, your CI tests pay tokens on every PR. You need all three.

This doesn't only apply to Anthropic. It applies to:

Embeddings services (OpenAI Ada → mock fixed vector → local with nomic-embed-text in Ollama → live)
Speech-to-text services (Whisper API → mock canned transcript → local Whisper.cpp → live)
Image services (DALL-E → mock placeholder → local Stable Diffusion → live)
Web search services (Tavily/Perplexity → mock canned results → local with SearxNG → live)

Any cognitive capability an agent delegates to an external provider deserves this structure.

What this changed in my mental roadmap

Day 4 originally was going to close with the Feature Developer agent working against real Claude. After the block, it closed with:

Feature Developer agent ✓ (in mock + local)
Three MODEs wired in Investigator, Project Explorer, Feature Developer
Project Mapper in mock when MODE=local (needs Anthropic SDK's native MCP, out of scope today)
Ollama + Qwen 2.5 Coder 32B setup documented and reproducible

What didn't close yet:

Try real live — depends on support@anthropic.com unblocking the account. Email sent. 12-24h wait.
Project Mapper in local mode — implementing the agentic loop with manual MCP would be a good Day N. For now, mocked.

How to replicate it

If you have a Mac M-series with 32GB+ RAM and want to run the whole system without spending a cent:

# 1. Clone the repo and add your key (even without balance)
git clone <repo-url>
cd agentic-architecture
cp .env.example .env
# edit .env and put ANTHROPIC_API_KEY=sk-ant-... (placeholder works)

# 2. Switch modes to local
sed -i '' 's/^INVESTIGATOR_MODE=.*/INVESTIGATOR_MODE=local/' .env
sed -i '' 's/^PROJECT_EXPLORER_MODE=.*/PROJECT_EXPLORER_MODE=local/' .env
sed -i '' 's/^FEATURE_DEVELOPER_MODE=.*/FEATURE_DEVELOPER_MODE=local/' .env
echo "OLLAMA_MODEL=qwen2.5-coder:32b" >> .env

# 3. Setup Ollama
brew install ollama
brew services start ollama
ollama pull qwen2.5-coder:32b   # ~20GB, takes 5-15 min

# 4. Build and run
npm run build
npm run up
npm run chat
# you'll see the 5-step pipeline with real LLM, free, no API keys

The first chat will take ~2-4 minutes (Ollama has to load the model into RAM the first time, then it's instant). In the response you'll see model_used: "qwen2.5-coder:32b" per agent, with real usage tokens.

Closing

What's curious about Day 4: the billing block ended up producing more architectural value than the planned happy-path integration. Without the forced pivot to Ollama, I wouldn't have designed the three-MODE pattern. I would have stuck with mock+live and the system would be fragile.

When Anthropic responds to the support ticket and unblocks the account, the change will be 3 lines in .env:

INVESTIGATOR_MODE=live
PROJECT_EXPLORER_MODE=live
FEATURE_DEVELOPER_MODE=live

Restart, and everything runs against real Claude. But I'm keeping local as default for development. Only live when I want to compare final quality.

Day 5 will be exactly that comparison: same prompt, same brief, same feature, first with Qwen Coder 32B local and then with Sonnet 4.6 live. Side-by-side. That's the blog post that comes out of today's block.

Day 4 closes with: 5 A2A agents running, 3 MODEs per LLM agent, architecture resilient to provider outages, and the lesson that the real roadmap is written by obstacles, not plans.

Continues: Day 5 — side-by-side comparison Qwen Coder 32B local vs Sonnet 4.6 via Anthropic API, on the same Agentikas blog feature.

Comments

Loading comments…

Day 4 — When Anthropic blocks you: why every A2A agent needs a local MODE

What happened

What I learned about Anthropic billing

Trap 1 — "Credit grant" ≠ paid credits

Trap 2 — Orgs are isolated for billing

Trap 3 — Stripe can fail silently due to incomplete address

Trap 4 — Tiers exist but aren't always clear

Why this block changed my architecture

The choice: Ollama + Qwen 2.5 Coder 32B

Runtime: Ollama

Model: Qwen 2.5 Coder 32B

The integration: three MODEs per agent

The `host.docker.internal` trick

Honest tradeoffs

What you gain with local

What you lose vs Claude

Decision table

The architectural lesson worth the post

What this changed in my mental roadmap

How to replicate it

Closing

Comments

Day 3 — A2A + MCP together: exposing a repo as a Model Context Protocol server

Day 2 — From Auditor to Tech Lead: why your agent names drive the prompts

Day 1 — From zero to first A2A ping: a multi-agent walking skeleton

Day 4 — When Anthropic blocks you: why every A2A agent needs a local MODE

What happened

What I learned about Anthropic billing

Trap 1 — "Credit grant" ≠ paid credits

Trap 2 — Orgs are isolated for billing

Trap 3 — Stripe can fail silently due to incomplete address

Trap 4 — Tiers exist but aren't always clear

Why this block changed my architecture

The choice: Ollama + Qwen 2.5 Coder 32B

Runtime: Ollama

Model: Qwen 2.5 Coder 32B

The integration: three MODEs per agent

The host.docker.internal trick

Honest tradeoffs

What you gain with local

What you lose vs Claude

Decision table

The architectural lesson worth the post

What this changed in my mental roadmap

How to replicate it

Closing

Comments

Related articles

Day 3 — A2A + MCP together: exposing a repo as a Model Context Protocol server

Day 2 — From Auditor to Tech Lead: why your agent names drive the prompts

Day 1 — From zero to first A2A ping: a multi-agent walking skeleton

Discover similar posts

Day 3 — A2A + MCP together: exposing a repo as a Model Context Protocol server

Day 2 — From Auditor to Tech Lead: why your agent names drive the prompts

Day 1 — From zero to first A2A ping: a multi-agent walking skeleton

The `host.docker.internal` trick