Skip to main content
← Back to blog

Agentikas MVP — 7 agents to codify your senior engineer experience

An agentic software development architecture where human experience gets codified into skills, schemas, and names. It works — but without a senior behind it, it's the perfect recipe for technical debt at AI speed. The MVP, the 5 agents already running, the 2 yet to come, and why the tool doesn't replace judgment.

Photo by Igor Omilaev on Unsplash
Agentikas MVP — 7 agents to codify your senior engineer experience

The question that led me to build this: how much of what I do as a senior engineer isn't writing code — it's deciding how it gets written?

I've spent years building software. I went through every phase: junior learning not to break things, mid reading other people's code, senior making architectural decisions without thinking, lead designing systems that survive beyond my own head. And one day, a few months ago, an honest observation:

80% of my senior work wasn't writing code. It was deciding what gets written, where, following which pattern, with which tests, under which constraints.

Modern AI tools (Claude, Cursor, Copilot, the Anthropic and OpenAI SDKs with tool use, MCP, agentic loops) change the rules: the writing is getting cheaper by the month. But the deciding — that human part — is still here.

Agentikas is my attempt to codify the deciding. Not the code itself, but the decisions that make code not suck. And this MVP is the first working version of that idea: seven specialized agents, each with a well-defined skill, orchestrated via A2A, reading your repo via MCP, and chained into a pipeline that produces reviewable code.

This post describes the MVP in detail, the technical decisions behind it, and — honest from day one — the risks. Because if this tool ends up in inexperienced hands, the result isn't an accelerator. It's technical debt at AI speed.


The thesis in one image

                  ┌──────────────────────────────────────┐
                  │           Orchestrator               │
                  │  conversational, owns the flow       │
                  └─────────────────┬────────────────────┘
                                    │ A2A (HTTP/JSON-RPC)
       ┌──────────┬─────────────────┼──────────────┬──────────────┐
       ▼          ▼                 ▼              ▼              ▼
  ┌────────┐ ┌────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────┐
  │Project │ │Investi-│ │   Project    │ │   Feature    │ │ Notifier │
  │Mapper  │ │gator   │ │   Explorer   │ │   Developer  │ │          │
  │ :8003  │ │ :8001  │ │  :8002       │ │   :8004      │ │  :8005   │
  └────┬───┘ └────────┘ └──────────────┘ └──────┬───────┘ └──────────┘
       │ MCP                                     │ persist
       ▼                                         ▼
  ┌──────────────┐                       ┌────────────────┐
  │ Codebase MCP │                       │ Cloudflare R2  │
  │   :9000      │                       │ agentic-       │
  └──────────────┘                       │ artifacts      │
                                         └────────────────┘
       ▲                                          ▲
       │ ro mount                                 │ S3 SDK
       │                                          │
  target repo                              persisted patches

Plus two more agents on the TODO list:

  • QA Tester (:8006) — tests, iterates with the Feature Developer until all tests pass, opens the PR.
  • Project Tech Librarian (:8007) — documents ADRs, recurring errors, emerging best practices, and improves the skills over time.

Total: 7 agents + one MCP server + a conversational orchestrator. Each with an active role (not descriptive), each with its own swappable skill markdown, each schema-validated with Zod.


The 5 agents in today's MVP

1. Project Mapper

Reads the target repo through a Codebase MCP server (Model Context Protocol). Returns a structured ProjectMap: stack_summary (what the Investigator needs) + project_map (what the Tech Lead needs).

The elegant detail: it uses Anthropic SDK's native MCP integration. Zero lines of MCP client code written — the SDK discovers the server's tools, exposes them to Claude, manages the entire agentic loop.

const response = await claude.beta.messages.create(
  {
    model: "claude-sonnet-4-6",
    max_tokens: 4000,
    system: skillContent,
    mcp_servers: [
      { type: "url", url: "http://codebase-mcp:9000/mcp", name: "codebase" },
    ],
    messages: [{ role: "user", content: "Map this repo..." }],
  },
  { headers: { "anthropic-beta": "mcp-client-2025-04-04" } },
);
// Claude calls get_repo_info, list_directory, read_file dynamically.
// You only get the final synthesized result.

2. Investigator

External research: given a feature + project stack, returns canonical patterns, standard libraries, known pitfalls, and edge cases that must be tested.

Swappable skill per flow: feature-research.md for new features, upgrade-research.md for version upgrades, security-research.md when security flows arrive. Same agent, different mission.

3. Project Explorer (Tech Lead)

The decider. Receives research + project_map and produces a DevelopmentBrief: concrete file paths, follow_pattern pointing to a real reference file in the repo, numbered test_scenarios, constraints (max_files, forbidden patterns), acceptance_criteria, out_of_scope.

It's the agent loaded with the most "codified senior experience". Its prompt isn't "audit the code" — it's "you're a tech lead; produce a brief a junior can execute without asking clarifying questions." The reframing of the name ("Auditor" → "Tech Lead") changes the model's outputs more than any elaborate prompt.

{
  "feature_summary": "Magic-link auth: email → token → session",
  "files_to_create": [
    { "path": "src/auth/magic-link.ts", "purpose": "core token logic" }
  ],
  "follow_pattern": {
    "reference_file": "src/auth/oauth.ts",
    "why": "imitate error handling, naming, session integration"
  },
  "test_scenarios": [
    { "id": "t1", "name": "valid email generates token", "type": "unit" },
    { "id": "t2", "name": "expired token rejected", "type": "unit" }
  ],
  "constraints": {
    "max_files_changed": 8,
    "test_framework": "vitest",
    "forbidden_patterns": ["any", "ts-ignore"]
  },
  "out_of_scope": ["Refactoring legacy auth code"]
}

4. Feature Developer

Executes the brief with TDD: tests first, code after. Output is a structured patch (list of files with content) — it doesn't write to disk.

Why doesn't it write? Sandboxing. Real writing is the job of a future Applier component with restricted permissions. For now, the output flies as JSON and gets persisted to Cloudflare R2 (bucket agentic-artifacts) so a human or the future QA Tester can apply it.

5. Notifier

The simplest one. No LLM. Receives subject + body + recipients and delivers via email/Slack/webhook (in walking skeleton: log to stdout). It exists because separating delivery from reasoning keeps cognitive agents clean. The day we want to switch from SMTP to SendGrid or add Slack, we touch one agent, not five.

+ Codebase MCP server

Not an agent — it's infrastructure. Exposes the target repo as a Model Context Protocol server with 4 tools: list_directory, read_file, search (ripgrep), get_repo_info. Sandboxed (every path verified against REPO_ROOT), execFile (no shell — anti-injection), StreamableHTTP transport.

The critical part: it doesn't know A2A exists. It's a standard MCP server that any MCP client can consume (Claude Desktop, another IDE, another agent). The Project Mapper uses it internally without the rest of the system noticing. A2A coordinates; MCP provides capabilities. Orthogonal composition.


The 2 pending agents

QA Tester (next up)

The one closing the quality loop. Receives the Feature Developer's patch and:

  1. Reconstructs the filesystem in an isolated sandbox
  2. Runs: type-check, lint, existing suite + new tests from the patch
  3. If everything passes: runs git checkout -b feat/<slug>, commit, push, opens a PR via the GitHub API
  4. If something fails: returns state: input-required to the Feature Developer with a structured payload:
{
  "iteration": 2,
  "failures": [
    {
      "test_id": "t3",
      "error": "Expected 'expired' but got 'invalid'",
      "file": "src/auth/__tests__/magic-link.test.ts:45"
    }
  ],
  "suggestions": [
    "Token expiration check at line 23 should compare with Date.now() in ms"
  ]
}

The Feature Developer takes this, adjusts its patch, returns a new attempt. Loop capped at 3-5 iterations. After that → state: input-required to the human. Without a cap, agents can iterate indefinitely with micro-fixes that never converge.

This requires:

  • Real sandbox (Docker-in-Docker or a dedicated runner)
  • Git token with repo permissions
  • GitHub API SDK (@octokit/rest)

Project Tech Librarian

The most interesting in the long run. Its job:

  • Reads R2 historicals (every brief, patch, test report from the past)
  • Detects recurring patterns: errors the QA Tester rejects repeatedly, similar architectural decisions across briefs, best practices the Tech Lead applies consistently
  • Generates ADRs (Architecture Decision Records) when a decision repeats N times
  • Documents antipatterns when the same bug shows up in M tasks
  • Proposes improvements to the skill markdowns when a skill produces mediocre output repeatedly

It's the agent that turns the system from static to adaptive. Without the Librarian, skills are human code you maintain by hand. With the Librarian, the system improves with every cycle — every merged PR is training for the next one.

Schema-as-policy: the LibrarianReport requires evidence_artifacts: string[] with at least 2 entries. It doesn't generate docs without real evidence from past tasks supporting the learning. That prevents inventing "best practices" that sound good but aren't validated by actual system usage.


The engineering behind it: monorepo, workspaces, Docker

Repo structure

agentic-architecture/
├── package.json              ← root with workspaces declared
├── package-lock.json         ← single lockfile
├── Dockerfile.workspace      ← ONE dockerfile that serves every service
├── docker-compose.yml
├── .dockerignore
│
├── shared/                   ← package "@agentikas/shared"
│   ├── package.json
│   ├── a2a-client.ts         ← reusable A2A client
│   ├── artifact-store.ts     ← LocalStore + R2Store
│   └── local-llm.ts          ← Ollama OpenAI-compat wrapper
│
├── orchestrator/             ← package
│   ├── package.json          ← depends on "@agentikas/shared"
│   ├── tsconfig.json
│   └── src/server.ts
│
├── agents/
│   ├── investigator/         ← package
│   ├── notifier/
│   ├── project-explorer/
│   ├── project-mapper/
│   └── feature-developer/
│
├── services/
│   └── codebase-mcp/         ← MCP server, not an agent
│
├── skills/                   ← markdown, mounted read-only into containers
│   ├── feature-research.md
│   ├── feature-briefing.md
│   ├── feature-development.md
│   ├── project-mapping.md
│   └── upgrade-research.md
│
├── workspace/                ← LocalStore (gitignored)
└── blog/                     ← markdown for posts (this one included)

npm workspaces: hoisting + symlinks

The root package.json declares:

{
  "name": "agentic-architecture",
  "workspaces": ["shared", "orchestrator", "agents/*", "services/*"]
}

npm install at root:

  1. Installs ALL deps from ALL packages in the root /node_modules (hoisting)
  2. Symlinks every workspace package into each consumer's node_modules
  3. Result: a single lockfile, deduplicated deps, agents importing from "@agentikas/shared/local-llm" instead of from "../../../shared/local-llm.js"

Without workspaces (what I had before): each agent with its own package.json + node_modules, shared/ copied as source code with no deps of its own. When I added R2Store and it needed @aws-sdk/client-s3 from shared/, everything broke: Node ESM resolution looked for the dep in /app/shared/node_modules which didn't exist. Workspaces fixes this automatically — the dep gets hoisted to the root, visible to everyone.

One parameterized Dockerfile

The trick that cleans up the repo most:

# Dockerfile.workspace — single one, serves 7 services
FROM node:22-alpine
WORKDIR /app
RUN apk add --no-cache ripgrep git

# Cacheable layer: every workspace's package.json
COPY package.json package-lock.json* ./
COPY shared/package.json ./shared/
COPY orchestrator/package.json ./orchestrator/
COPY agents/investigator/package.json ./agents/investigator/
# … other workspaces

RUN npm install --include=dev          # hoists to /app/node_modules

COPY . .                                # source code

ARG SERVICE_PATH                        # parameterizable
WORKDIR /app/${SERVICE_PATH}
CMD ["npm", "start"]

And in docker-compose.yml each service passes its own SERVICE_PATH:

investigator:
  build:
    context: .
    dockerfile: Dockerfile.workspace
    args:
      SERVICE_PATH: agents/investigator

feature-developer:
  build:
    context: .
    dockerfile: Dockerfile.workspace
    args:
      SERVICE_PATH: agents/feature-developer

One single Dockerfile to maintain instead of seven. Adding a new agent (QA Tester tomorrow, Tech Librarian next week) is one entry in docker-compose, zero new files in the agent's directory beyond its own code.

Git: where does the code the agents produce end up?

Today:

  • Skills (skills/*.md) → in this repo (agentic-architecture), versioned normally
  • Cognitive outputs (briefs, patches, reports) → Cloudflare R2, bucket agentic-artifacts, key tasks/<task_id>.json
  • Code generated by Feature Developer → when QA Tester is ready, goes to feat/<slug> in the client's target repo via the GitHub API

The target repo is separate from the architecture repo. The agentic system doesn't write to its own code — it writes to the code of the project it's helping develop. The Codebase MCP server mounts it read-only:

codebase-mcp:
  volumes:
    - ${TARGET_REPO:-./}:/repo:ro

TARGET_REPO is an env var: defaults to the agentic-architecture repo itself (great for dogfooding and demos), but pointing it to any path on disk swaps the target. When we deploy to Cloudflare, it'll be a volume mounted from a fresh clone of the client's repo.


Three MODEs per LLM agent

Every cognitive agent has three operation modes, controlled by env var:

INVESTIGATOR_MODE=mock     # canned response, $0, instant
INVESTIGATOR_MODE=local    # Ollama + Qwen Coder 32B, $0, ~30s
INVESTIGATOR_MODE=live     # Anthropic Claude Sonnet 4.6, ~$0.01, ~3s

mock for CI and integration tests. local for serious development without burning API credits or leaking code (privacy: everything stays on your Mac). live for max quality in production.

The architectural trick: the if/else lives inside the agent, not in a different class per mode. Same signature, different cognitive engine.

if (MODE === "mock") {
  textContent = mockReport(topic);
} else if (MODE === "local") {
  const r = await callLocalLLM({ system, user, maxTokens: 2000 });
  textContent = r.text;
} else {
  const response = await claude.messages.create({ ... });
  textContent = extractText(response);
}

That callLocalLLM function lives in shared/local-llm.ts — one single implementation reused by the 4 LLM agents. The day you swap in another local model (vLLM, llama.cpp server, another distro), you touch one file, not four.


How human experience gets codified

This section is the one I care about most, because it's the project's thesis.

1. Names → prompts

LLM models are extraordinarily sensitive to declared identity. An agent named "Auditor" produces problem lists. One named "Tech Lead" produces executable specs. Same input, same literal prompt — different emergent behavior because of the name's pull.

When you design the network, naming your agents well is implicit prompt engineering. "Project Mapper" suggests mapping; "Codebase Auditor" would suggest finding problems; "Stack Detector" would suggest minimalism. By picking the right name, you bias the model's output in the direction you want.

2. Skills as swappable markdown

Each agent loads its mission from a .md file passed in the payload:

const skillContent = readFileSync(
  resolve(SKILLS_DIR, payload.skill_uri.replace(/^skills\//, "")),
  "utf-8",
);

// And passes it to Claude as system prompt:
const response = await claude.messages.create({
  system: skillContent,   // ← the "mission"
  messages: [{ role: "user", content: userPrompt }],
});

The markdowns get mounted read-only into containers as a Docker volume:

investigator:
  volumes:
    - ./skills:/app/skills:ro

Editing a skill doesn't require a rebuild. You change the markdown, the next task picks it up. That turns improving the system into something a product manager with good judgment can do without touching TypeScript.

3. Constraints in schemas (not in wikis)

The Tech Lead's brief defines forbidden_patterns: ["any", "ts-ignore"]. The Feature Developer cannot return a patch that contains those patterns without the schema rejecting it. That's schema-as-policy: the business rule travels with the data, not with documentation.

const DevelopmentResult = z.object({
  patches: z.array(...),
  tests_written: z.number().min(1),       // ≥1, not optional
  constraints_violated: z.array(z.string())  // if non-empty → state: failed
});

4. follow_pattern: analogy over imagination

The most important field in the brief isn't the file list — it's follow_pattern.reference_file. Pointing to a real file in the repo where a similar pattern already exists massively reduces hallucinations. The LLM works by analogy with real code, not by abstract imagination.

This mirrors exactly how a senior teaches a junior: "look at how we did it in the auth module, do it the same way here." We don't explain patterns from scratch. We point at the example and ask them to imitate it.


Real advantages of the agentic approach

  1. Codifies experience that would otherwise be lost: when a senior leaves, their judgment leaves with them. If it lives in skills/markdowns, it doesn't.

  2. Parallel scalability: 5 different features in flight at the same time = 5 independent pipelines. A human team doesn't scale that way without losing coherence.

  3. ~$0.10-0.15 per feature (in live mode with Sonnet). Versus ~€150 for a junior dev's morning. We're not replacing seniors — we're parallelizing execution.

  4. Provider resilience: three MODEs (mock/local/live) means an Anthropic outage doesn't block development. I hit this last week when billing got stuck for 6 hours.

  5. Auditability: every brief, patch, and test report gets persisted to R2 with its task_id. Three months later you can answer "why was this code designed this way?" — it's literally all saved.

  6. Speed of iterating the system itself: editing a skill markdown and watching how the agent's behavior changes on the next task is a feedback loop that doesn't exist in human teams.


Honest tradeoffs

  1. Cognitive quality below the frontier: Qwen Coder 32B local works great for iteration, but doesn't reach Sonnet 4.6 on complex code. Sonnet 4.6 in turn doesn't reach a seasoned human senior on non-trivial architectural decisions.

  2. Latency: 5-step pipeline in local mode takes ~6 minutes. In live ~45 seconds. Compare with a human senior who decides obvious things in 30 seconds. It's not real-time feedback yet.

  3. Complex sandboxing: the Feature Developer doesn't write to disk today. The future QA Tester will require Docker-in-Docker or a permission-checked runner agent. That's serious security infrastructure work.

  4. Dependency chain: if the Project Mapper fails, everything downstream fails. You have to design fallbacks (mock per agent) and gracefulness (structured state: failed), which adds complexity.

  5. Cognitive cost of maintaining prompts: when LLM models change (new version, different family), skills may need rework.


The risk almost no one mentions: without a senior, it doesn't work

Here comes the uncomfortable part.

AI tools democratize execution but not judgment. Having 7 agents running doesn't turn a junior into a senior. It turns a senior into someone with superpowers — and a junior into someone producing technical debt faster than they can review it.

What happens when an inexperienced team adopts this?

1. They don't catch bad briefs. The Tech Lead might produce a brief that sounds plausible but points at src/features/oauth/ when that path doesn't exist. A senior would catch it reading the brief in 10 seconds. A junior accepts, executes, discovers the problem 2 hours later with half-written code.

2. They don't catch bad code that passes tests. Tests measure what you ask them to measure. If the brief didn't include the right edge case, the tests pass and the code has a latent bug. A senior smells "this passes but something feels off". A junior celebrates the green and merges.

3. They trust output that looks good. LLMs produce plausible prose and syntactically correct code. The line between "correct code" and "plausible code" is what separates software that survives 2 years from software that crashes in production at 6 months. Only someone who's lived through the previous crashes sees that line.

4. They don't know when to stop the loop. The QA Tester will iterate with the Feature Developer until tests pass. But if the problem is design (the brief was wrong), iterating doesn't fix it — it just adds patches. A senior detects "this won't fix with more loops, we need to redesign". A junior keeps iterating until it appears to work.

5. They produce tech debt at AI speed. This is the most serious one. A junior with an agentic team produces more code than a senior alone. But the fraction of that code that's well-designed is smaller. The ratio of technical debt generated to value delivered inverts. And unlike a human junior who learns from mistakes, the junior with agents repeats them at scale because they don't recognize them.

The senior's role on an agentic team

Doesn't disappear. Changes:

  • Designs the skills: the markdown content is where the experience lives. Editing feature-briefing.md to add a new constraint is like training the Tech Lead.
  • Reviews outputs against intuition: when a brief smells wrong, they catch it. When a patch passes tests but leaves debt, they catch it.
  • Hunts failure modes: in the first weeks with the system, every weird case that shows up goes to the documentation (future Project Tech Librarian) or to a new constraint in the skills.
  • Iterates the meta-system architecture: do we need a new agent? Should we split two roles that are mixed? Is the brief length optimal? Those decisions still belong to the senior.

The senior no longer writes most of the code. They design the system that writes the code. And they review what it writes.

If your organization is thinking about adopting agentic architectures, the right question isn't "how many juniors can I replace?". It's "how many seniors do I need to supervise this system?". The answer is always greater than zero. And probably more than you thought.


What's coming in this series

This post is the manifesto. The technical series that follows:

  1. Day 1 — Walking skeleton: first A2A ping working
  2. Day 2 — From Auditor to Tech Lead: why names are prompts
  3. Day 3 — A2A + MCP together: exposing a repo as Model Context Protocol server
  4. Day 4 — The Anthropic billing block and the three-MODE pattern with Ollama
  5. Day 5 (next) — Side-by-side comparison Qwen Coder 32B local vs Sonnet 4.6 on the same brief
  6. Day 6 — Building the QA Tester with iteration loop
  7. Day 7 — The Project Tech Librarian that makes the system learn
  8. Day N — Cloudflare Workers deployment, real sandbox, multi-tenant

If you care about the how without losing sight of the why, subscribe. Every post keeps the honesty of real obstacles — including the bugs I hit, the reframes I made mid-implementation, and the moments I stopped to ask myself if I was building something useful or just sophisticated.

This system works because it's designed by someone who has seen too many times what happens when it doesn't. If you replicate it, make sure you carry the same scar tissue. If you don't, make sure you have a senior who does.


Agentikas MVP closes today with: 5 agents running, 1 MCP server, 2 pending agents (QA Tester, Tech Librarian), 3 modes per agent (mock/local/live), persistence to Cloudflare R2, npm workspaces monorepo, and a single parameterized Dockerfile for 7 services. The code is in the repo. The scars are in the blog posts to come.

Comments

Loading comments…