Skip to main content
← Back to blog

Day 1 — From zero to first A2A ping: a multi-agent walking skeleton

We built the functional skeleton of an agent network communicating via A2A from scratch. Three services, a conversational Orchestrator, a Notifier, and an Investigator with a real LLM. Real bugs included.

Day 1 — From zero to first A2A ping

Walking skeleton: the first version of the system that traverses the whole end-to-end path while doing the bare minimum at each step. — Alistair Cockburn

Today we kicked off a long project: building a network of agents that audit, improve and publish code autonomously, communicating with each other through the Agent-to-Agent (A2A) protocol. The end goal is five specialized agents plus a conversational orchestrator, all independent, agnostic about the stack of the project they audit, and deployable separately.

But today we're not building all five. Today we build the skeleton that traverses the whole end-to-end path with the bare minimum: an Orchestrator, an agent with a real LLM (Investigator), and an agent without LLM (Notifier). Any complexity solved here is complexity that won't surprise us when we add the other three.

This post documents Day 1: the decisions, the real bugs we hit, and why every piece is where it is.


The target architecture

Before writing any code, we drew the destination. The five specialized agents that will eventually form the system:

AgentResponsibility InvestigatorInvestigates a topic (version, library, technology) following a skill in markdown. Returns a structured report. AuditorCompares current code with the Investigator's report. Returns findings, focused on the delta between current and target version. DeveloperApplies suggested changes following TDD. Opens a PR on GitHub. TesterRuns regression tests on the merged PR. Needs human approval before starting. NotifierSends release notes to the subscriber list.

Above them, an Orchestrator — the only one that talks to humans and the only one that knows the full flow. The specialized agents are "dumb about flow, expert in their domain". The Orchestrator is the opposite: "smart about flow, dumb about domain". It's the Coordinator vs. Worker pattern from classic distributed systems.

                    ┌───────────────────────────┐
                    │   Orchestrator Agent      │
                    │   (Chat UI + A2A client)  │
                    └─────────────┬─────────────┘
                                  │ A2A
        ┌──────────┬──────────────┼──────────────┬──────────┐
        ▼          ▼              ▼              ▼          ▼
  Investigator  Auditor       Developer       Tester     Notifier
  :8001        :8002          :8003          :8004       :8005

Every box is an independent process, with its own Agent Card, its own Dockerfile, and its own port. The Orchestrator discovers who's alive by reading each GET /.well-known/agent.json.


Decision 1: TypeScript, not Python

Many people's reflex when designing LLM agents is to go to Python. It's where LangChain, LangGraph, AutoGen, CrewAI live. We actually started the scaffolding in Python with FastAPI + Pydantic, but switched within hours. The reason:

The stack you already have intuition with weighs more than the stack "traditionally used for agents".

An A2A system is, on 95% of its surface, HTTP plumbing: serving endpoints, validating payloads, calling other services, parsing responses. For that, TypeScript in 2026 is fully caught up — the Anthropic SDK is first-class, fetch is native in Node 22+, and frameworks like Hono provide a DX that rivals FastAPI.

What's still true is that experimental research in agents lands in Python first (papers, academic repos). But we're building a production system, not doing research. Keep the language in familiar territory so all the friction comes from the real problem (A2A), not the language.

Equivalences we used

ConceptPythonTypeScript (what we use) Async HTTP serverFastAPIHono Validation + typesPydanticZod Runtimeuvicorntsx Lock-inlowlow

Hono specifically — and not Express or Fastify — because it was born typed, routes return inferred types, and it runs identically on Node, Deno, Bun, Cloudflare Workers. For a network of agents where each one is a small HTTP service, Hono fits perfectly.


Decision 2: walking skeleton before detailed design

We had pseudo-Pydantic schemas written for the five reports (ResearchReport, AuditReport, DevelopmentResult, TestReport, NotificationReceipt). They were pretty. Versioned, with discriminated unions, with ArtifactRef everywhere to avoid embedding heavy reports.

And we tossed them out.

When it's the first time you touch A2A, designing perfect schemas before having anything running is the classic mistake. The questions that really matter aren't answered by design — they're answered by the first curl.

How do you handle tasks that take 60 seconds? How do you do streaming? Where do you store a 200 KB report when one agent passes it to another? What does the Orchestrator do if an agent is down? These questions get answered the moment you watch the first message travel between two services. Before that, they're theory.

Decision: scrap the schemas, build the end-to-end path with two services, allow dict[str, Any] types, and type what you've learned later.


The final stack

agentic-architecture/
├── orchestrator/
│   ├── src/server.ts           Hono + Zod + A2A client
│   ├── agent_card.json
│   ├── package.json
│   ├── tsconfig.json
│   └── Dockerfile
├── agents/
│   ├── investigator/           agent with real LLM (+ skill md)
│   │   ├── src/server.ts
│   │   ├── agent_card.json
│   │   ├── package.json
│   │   ├── tsconfig.json
│   │   └── Dockerfile
│   └── notifier/               LLM-less agent, ideal to start
│       └── (same layout)
├── shared/
│   ├── a2a-client.ts           reusable A2A client
│   └── package.json
├── skills/
│   └── upgrade-research.md     mission passed to the Investigator
├── docker-compose.yml
├── package.json                project scripts (replaces Makefile)
├── .env.example
└── .gitignore

Every agent is an independent service: it has its own package.json, its own Dockerfile, and its own port. Shared code lives in shared/, copied into each container at build time. It's deliberately simple — no npm workspaces, no monorepo tooling. For walking skeleton, simplicity wins.


Anatomy of the first A2A ping

Let's follow a message from the browser to the Notifier's log.

1. The Agent Card

Every agent exposes an Agent Card at /.well-known/agent.json. It's the agent's "business card": what it can do, what skills it declares, what input/output modes it supports.

{
  "name": "notifier",
  "description": "Sends notifications about releases and updates.",
  "url": "http://notifier:8005",
  "version": "0.1.0",
  "capabilities": {
    "streaming": false,
    "pushNotifications": false,
    "stateTransitionHistory": false
  },
  "defaultInputModes": ["application/json"],
  "defaultOutputModes": ["application/json"],
  "authentication": { "schemes": ["none"] },
  "skills": [
    {
      "id": "send-notification",
      "name": "Send notification to subscribers",
      "description": "Delivers a subject + body to recipients.",
      "tags": ["notification", "email"]
    }
  ]
}

The skills are the important part: they identify what an agent can do. Other agents (in this case the Orchestrator) know what to ask of it by reading this list.

2. The A2A client

The Orchestrator talks to other agents through a common client (shared/a2a-client.ts). In its walking-skeleton form it's ~45 lines:

import { randomUUID } from "node:crypto";

export class A2AClient {
  constructor(private baseUrl: string, private timeoutMs = 60_000) {
    this.baseUrl = baseUrl.replace(/\/$/, "");
  }

  async sendTask({ skillId, payload }: {
    skillId: string;
    payload: Record<string, unknown>;
  }) {
    const taskId = `task-${randomUUID().replace(/-/g, "").slice(0, 12)}`;
    const r = await fetch(`${this.baseUrl}/tasks/send`, {
      method: "POST",
      headers: { "content-type": "application/json" },
      body: JSON.stringify({ task_id: taskId, skill_id: skillId, payload }),
      signal: AbortSignal.timeout(this.timeoutMs),
    });
    if (!r.ok) throw new Error(`task failed ${r.status}: ${await r.text()}`);
    return r.json();
  }
}

Deliberate shortcut: the official A2A spec uses JSON-RPC 2.0 over HTTP. We're using plain REST here. The payload shape is virtually identical (task_id, skill_id, payload, state), but REST is 10× easier to debug with curl. Migrating to JSON-RPC is ~30 lines later, when we need it.

3. The Orchestrator's /chat handler

When the browser POSTs to /chat, the Orchestrator fires the pipeline:

app.post("/chat", async (c) => {
  const { text } = await c.req.json();

  // Step 1 — Investigator investigates the topic
  const research = await investigator.sendTask({
    skillId: "research",
    payload: {
      skill_uri: "skills/upgrade-research.md",
      topic: text,
    },
  });

  if (research.state !== "completed") {
    return c.json({ error: research.error }, 502);
  }

  // Step 2 — Notifier "delivers" the summary
  const notification = await notifier.sendTask({
    skillId: "send-notification",
    payload: {
      subject: `Research: ${text}`,
      body: research.result.structured?.summary ?? "(no summary)",
      recipients: ["walking-skeleton@example.com"],
    },
  });

  return c.json({
    user_message: text,
    pipeline: [
      { step: "investigate", task_id: research.task_id, ... },
      { step: "notify", task_id: notification.task_id, state: notification.state },
    ],
  });
});

This handler is the prototype of the playbook engine that will come later: a sequence of steps, each delegating to an agent, chaining outputs as inputs of the next. When all five agents arrive, this gets refactored to YAML playbooks — but the logic doesn't fundamentally change.

4. The task_id propagated end-to-end

Every task the Orchestrator sends gets a unique task_id. That ID travels with the payload, comes back in the response, and shows up in the logs of every agent involved. The consequence: you can grep a task_id across all logs and see the full journey of a request.

docker compose logs | grep task-c1fb1870c755

It's the basis of distributed tracing without OpenTelemetry. When all five agents arrive, this pattern will be the only sane way to debug when a task takes 8 seconds and you need to know who was slow.


Bugs we hit today (and why they matter)

No system runs the first time. These are the three real bugs we found today that will save you hours if you know them before hitting them.

Bug 1 — localhost127.0.0.1 in Docker healthchecks

We configured docker-compose healthchecks with the natural URL:

healthcheck:
  test: ["CMD", "wget", "-q", "-O", "-", "http://localhost:8005/health"]

Result: container (unhealthy) forever. Service logs said listening on :8005, netstat confirmed 0.0.0.0:8005 LISTEN, but wget http://localhost:8005 from inside the container itself returned Connection refused.

Cause: in Node 18+, DNS resolution of localhost prioritizes IPv6 (::1) over IPv4 (127.0.0.1). If the server only binds to IPv4 (which Hono does by default), requests to localhost silently fail. The fix is trivial once you know:

test: ["CMD", "wget", "-q", "-O", "-", "http://127.0.0.1:8005/health"]

Extrapolable lesson: in container healthchecks, always literal 127.0.0.1, never localhost. It saves you hours when it happens the first time.

Bug 2 — TypeScript ESM cross-folder without package.json in each folder

We started with shared/ as a flat folder, no package.json. The Orchestrator imported from it with:

import { A2AClient } from "../../shared/a2a-client.js";

Result: SyntaxError: The requested module '../../shared/a2a-client.js' does not provide an export named 'A2AClient'.

Cause: with "type": "module" in the Orchestrator's package.json but no package.json in shared/, Node and tsx get confused about how to interpret the imported module (ESM or CJS?). The error is misleading because it looks like the export doesn't exist — but really the module is being loaded in the wrong mode.

Fix: add a minimal package.json to shared/ explicitly declaring the mode:

{
  "name": "@agentikas/shared",
  "version": "0.1.0",
  "private": true,
  "type": "module"
}

Extrapolable lesson: in TS monorepos with ESM, any folder that contains importable code needs its own package.json, even if just to declare "type": "module". Not optional, it's the line that separates "works" from "obscure resolution errors".

Bug 3 — LLM errors as HTTP 500 with stack trace instead of state: failed

When Anthropic rejected our first real request (billing problem, see below), the Investigator returned HTTP 500 with a raw JSON stack trace to the Orchestrator. The Orchestrator in turn surfaced it as another 500 to the browser. A cascade of unstructured errors.

The A2A protocol explicitly defines a state: "failed" with a readable error field. The right way:

try {
  const response = await claude.messages.create({ ... });
  // ...
} catch (err) {
  const message = err instanceof Error ? err.message : String(err);
  return c.json({
    task_id,
    state: "failed" as const,
    error: `LLM call failed: ${message}`,
  });
}

Extrapolable lesson (and one of the most important of the day): an agent's response schemas aren't just "shape" — they're policy. The Orchestrator must be able to read state reliably for any request, regardless of whether the cause was success, LLM failure, rate limit, or billing exhausted. Schema-as-policy turns a business rule ("always return in A2A format") into an invariant impossible to bypass.


The Investigator: first agent with a real LLM

Once the Orchestrator → Notifier ping (no LLM) was working, we added the Investigator as the third service. It's the first agent with real reasoning — it uses the Anthropic SDK to call Claude.

The "skill as markdown" pattern

What defines the Investigator isn't code: it's a markdown file passed to it as parameter. Simplified example of skills/upgrade-research.md:

You are a software upgrade research agent. Given a TARGET and optional
PROJECT CONTEXT, produce a concise, actionable report focused on:

1. Current state — latest stable version vs project's version
2. Breaking changes — ordered by impact
3. Required migrations
4. Recommended (optional) changes
5. Known pitfalls
6. Best practices post-upgrade

End your response with a fenced JSON block:
{
  "summary": "...",
  "findings": [{title, severity, category, description}, ...]
}

The Investigator loads this file when receiving each task and uses it as system prompt. The architectural consequence:

Separating the mission (what to investigate, in markdown) from the agent (how to investigate, in code) lets you reuse the same Investigator to audit Next.js today and FastAPI tomorrow, just by changing the .md file passed in the payload.

It's the same pattern Claude Code uses with its skills: the agent is generic, the skill defines the concrete task.

The handler

app.post("/tasks/send", async (c) => {
  const { task_id, skill_id, payload } = parseTask(await c.req.json());
  const { skill_uri, topic, project_context } = ResearchPayload.parse(payload);

  const skillContent = readFileSync(resolve(SKILLS_DIR, skill_uri), "utf-8");

  const userPrompt = [
    `TARGET: ${topic}`,
    project_context?.name && `PROJECT: ${project_context.name}`,
    project_context?.current_version && `CURRENT: ${project_context.current_version}`,
  ].filter(Boolean).join("\n");

  const response = await claude.messages.create({
    model: MODEL,
    max_tokens: 2000,
    system: skillContent,           // ← the "mission"
    messages: [{ role: "user", content: userPrompt }],
  });

  // Extract text + try to parse the final JSON block
  const reportMd = response.content
    .filter(b => b.type === "text")
    .map(b => b.text)
    .join("\n");

  const structured = extractJsonBlock(reportMd);

  return c.json({
    task_id,
    state: "completed",
    result: { report_md: reportMd, structured, model_used: MODEL, usage: response.usage },
  });
});

Worth noting: skills get mounted as a read-only Docker volume. Editing a skill doesn't require a rebuild of the container — the next task picks it up from the host filesystem:

investigator:
  volumes:
    - ./skills:/app/skills:ro

The non-technical obstacle: Anthropic billing

When we fired the first real chat with LLM, the API rejected the request:

{"type":"error","error":{
  "type":"invalid_request_error",
  "message":"Your credit balance is too low to access the Anthropic API."
}}

Despite the Anthropic console showing US$5.00 Remaining balance.

What we learned about Anthropic billing, in order of probability of hitting you:

  1. Orgs are isolated for billing: an API key belongs to one specific org. If you add credits to org A but your key is from org B, the key still has no balance. It's the equivalent of having two bank accounts — putting money in one doesn't fill the other. The diagnostic clue: the anthropic-organization-id header in any API response tells you which org the key belongs to.

  2. "Credit grant" ≠ paid credits: Anthropic gives promotional credits in "grant" state, but some accounts require a real paid purchase to "activate" them. The red banner "To get started... purchase some credits" shows up even with visible balance when in this intermediate state.

  3. Workspaces have individual limits: each workspace can have its own spend limit. By default some new workspaces come with limit $0 — anti-surprise protection that becomes a foot-gun the first time.

  4. Tier-based access: Anthropic classifies accounts in tiers (Tier 1, 2, 3, ...) based on accumulated spend. Available models and rate limits depend on the tier.

Architectural implication: any system depending on a paid external API needs a mock mode from day 1 — not as a temporary patch, but as permanent infrastructure.


Mock mode as a first-class citizen

To not block development while billing got resolved, we added the Investigator a mock mode controlled by env var:

INVESTIGATOR_MODE=mock   # returns canned report, no API call
INVESTIGATOR_MODE=live   # calls Claude (default)

The implementation is an if in the handler:

if (MODE === "mock") {
  textContent = mockReport(research.topic);
} else {
  const response = await claude!.messages.create({ ... });
  textContent = extractText(response);
}

The mock report returns the same structure as a real report (markdown plus JSON block with findings). Consumer agents can't tell the difference.

This isn't throwaway code. It's permanent infrastructure:

  • In CI/CD: integration tests run in mock, never touching Anthropic.
  • In demos: you present the architecture without depending on the API.
  • In debugging: if Claude is down, you flip the flag and continue with other agents.
  • In production: the flag is live. Period.

Any agent that talks to external systems (Git, SMTP, APIs) will have its own MODE so we can mock it in isolation. It's the foundation of isolated testing in A2A systems.


The minimal UI: HTML embedded in the Orchestrator

So we don't need to use curl every time we test something, we added a GET / endpoint with inline HTML in the Orchestrator's server.ts. ~50 lines, no React, no build step, no external assets:

const CHAT_UI = /* html */ `<!doctype html>
<html>
<head>...</head>
<body>
  <form id="f">
    <textarea id="t">Next.js 14 → 15 upgrade</textarea>
    <button>Investigate and notify</button>
  </form>
  <pre id="res">—</pre>
  <script>
    document.getElementById('f').addEventListener('submit', async (e) => {
      e.preventDefault();
      const r = await fetch('/chat', {
        method: 'POST',
        headers: { 'content-type': 'application/json' },
        body: JSON.stringify({ text: document.getElementById('t').value }),
      });
      document.getElementById('res').textContent =
        JSON.stringify(await r.json(), null, 2);
    });
  </script>
</body>
</html>`;

app.get("/", (c) => c.html(CHAT_UI));

Walking skeleton in pure form: the UI isn't pretty, isn't scalable, but it lets you iterate without curl. When it grows beyond ~80 lines, we'll move it to public/index.html with serveStatic. Not today — that would be noise.


Basic observability: the logger middleware

Hono doesn't log requests by default. Very different from Express with morgan. For an A2A system this always gets activated: when a task passes through 5 chained agents and takes 8 seconds, the only way to know who's being slow is seeing the latency of each hop:

orchestrator | <-- POST /chat
investigator | <-- POST /tasks/send
investigator | researching topic="Next.js 14 → 15" task_id=task-abc
investigator | done task_id=task-abc tokens_in=487 tokens_out=1834
investigator | --> POST /tasks/send 200 14823ms       ← LLM takes this
notifier     | <-- POST /tasks/send
notifier     | --> POST /tasks/send 200 5ms
orchestrator | --> POST /chat 200 14841ms             ← total

Sum of latencies = 14823 (investigator) + 5 (notifier) + ~13 (orchestration). Without this info, distributed debugging is blind.

Activation is three lines:

import { logger } from "hono/logger";

app.use("*", logger((msg) => {
  if (msg.includes("/health")) return;   // healthchecks every 5s = noise
  console.log(`[orchestrator] ${msg}`);
}));

The /health filter we learned right away: healthchecks run every 5 seconds. In logs they're pure noise. Filtering at source (in the logger callback) is better than filtering at consumption (with grep) because it saves storage when these logs go to Datadog/Loki/CloudWatch.

Filter deterministic noise. Preserve unexpected noise. That mental rule applies across all observability.


What we didn't do today (and why)

It's important to be explicit about Day 1's scope, because the list of "what was missing" is much longer than "what was built":

Not doneWhy Pydantic/Zod schemas typed to the millimeterWe deliberately left dict[str, Any]. Type after iterating. JSON-RPC 2.0 (real A2A spec)Plain REST is easier to debug. Migrate when it hurts. Streaming SSE for long responsesUnnecessary with two steps. Essential with five chained agents. Web search in the InvestigatorWithout it, the LLM responds with training knowledge. Enough to learn the pattern. ArtifactRef for heavy reportsTwo steps don't need shared storage. It'll show up with the Auditor. Auditor, Developer, TesterVariations of the Investigator. Leaving them for Day 2+. YAML playbook engineThe Orchestrator hardcodes the flow. Refactor when there are 3+ agents. Auth between agentsBearer tokens in Agent Cards. Day N. Automated testsWalking skeleton is tested with curl and eyes. Production deployAny cloud with Docker. Day N.

Each of these will arrive when its absence hurts, not before.


Lessons of the day

  1. The stack you already know wins, almost always. The learning curve of the real problem (A2A, agents, LLMs) is steep. Don't add the language's curve to it.

  2. Walking skeleton > pretty design. The important questions are answered by the first real ping, not the first diagram.

  3. Schemas are policy, not just shape. A required field encoding "no testing without human approval" is more robust than a comment on a wiki.

  4. Mock as a first-class citizen. Any external dependency deserves its own flag from day 1. It's testing, demo, and isolated debugging all at once.

  5. localhost is not 127.0.0.1. Memorize. This will bite you again.

  6. task_id propagated end-to-end is the basis of distributed tracing without OpenTelemetry. If you build it from day 1, you save yourself refactors.

  7. The Orchestrator is a router, not a more capable agent. The discipline: if it needs a new capability, create an agent, even trivial. The alternative is a god-object in six months.


Tomorrow: the Auditor

The natural next step is the Auditor: third agent, also with LLM, which receives the Investigator's report as input and produces an AuditReport with concrete findings about real code.

This is where we'll feel the first real schema friction: passing the report_id between agents instead of the full report, storing it somewhere (filesystem inside the container, redis, S3...). It's the moment ArtifactRef is born — but this time out of real necessity, not speculative design.


How to replicate today

git clone <repo>
cd agentic-architecture
cp .env.example .env       # edit with your ANTHROPIC_API_KEY (or INVESTIGATOR_MODE=mock)
npm run build              # ~60-90s first time
npm run up                 # starts the 3 services
npm run ping               # health + agent cards + test chat
npm run logs               # tail live

Then open http://localhost:8000/ and try with your own topic.

For the curious about the internal A2A client and Agent Cards, everything is in the repo. Any feedback, issue, or PR for Day 2 is welcome.


Day 1 closes with three independent services talking via A2A, an agent with real LLM (mockable), a minimal functional UI, and basic observability with propagated trace_id. All the friction we discovered today is friction that won't surprise us in the next four agents.

Continues: Day 2 — the Auditor and the ArtifactRef pattern.

Comments

Loading comments…