The Harness Is the Product: Testing iii Against My Multi-Agent App
A pragmatic comparison of my hand-rolled Vercel AI SDK multi-agent harness and iii's worker-based harness substrate: what each is good at, where each hurts, and what I learned testing the migration.
Every serious agent project eventually becomes a harness project.
At the beginning, you think you’re building an agent. You wire up a model, give it a tool, stream some tokens, maybe add a second agent because the demo looks better when one model argues with another model. Then the surface area starts expanding.
Who owns the turn loop? Where do tool calls get approved? What happens if the browser disconnects halfway through a run? Where does session state live? Can I cap spend? Can I see which agent burned the tokens? Can I swap a provider without rewriting half the code? Can I replay the run after something breaks?
That surrounding runtime is the agentic harness — the layer I’ve argued is the moat, not the model.
Frameworks and libraries all answer the harness question differently. Some give you a thin loop and say “bring your own production system.” Some give you a graph abstraction. Some give you a full platform. I wanted to feel the trade-off directly, so I built one version myself and then started testing a very different substrate: iii.1
This post isn’t a benchmark or a vendor dunk. It’s a build note from the messy middle: my initial AI SDK harness in multi-agents-team,2 the live playground at mat.umai-tech.com,3 what iii changes, where the migration feels good, and where it adds operational weight. (Full disclosure: I liked the engine contract enough to start contributing to iii’s Go SDK along the way, so read this as an interested party’s build notes.)
The short version
My in-app harness is the right shape for learning, demos, and fast iteration. iii is interesting when the harness needs production jobs: durable execution, policy, budgets, server-side state, and observability. The trade-off is that you now operate a real engine, not just a Next.js app.
The harness problem
When people talk about agents, they often talk about models, prompts, tools, and memory. Those matter. But after the prototype stage, the painful questions usually aren’t “which model?” They’re:
- Control flow: who decides the next step, and how do you stop loops?
- Tool execution: which calls are allowed, denied, or human-approved?
- Streaming: how does the UI see intermediate reasoning, tool calls, and final state?
- State: what survives a refresh, a deploy, or a server restart?
- Cost: where do token usage and budget limits actually get enforced?
- Observability: can you inspect a run by session, message, agent, or tool call?
- Provider boundaries: can you move from OpenAI to Anthropic, Mistral, or Fireworks without rewriting the app?
That list is the harness.
And once you see it, you start seeing harnesses everywhere. LangGraph is a harness with explicit graph control. AutoGen is a harness built around multi-agent conversations. Pi calls itself exactly what it is — a minimal agent harness you adapt to your workflows, not the other way around.4 OpenClaw is a harness for a personal assistant: one agent runtime wired into WhatsApp, Telegram, your inbox, and your calendar.5 The Vercel AI SDK gives you excellent model/tool/streaming primitives, but the harness around those primitives is still your job.6 Anthropic’s multi-agent research system is another harness shape: specialist agents, a lead agent, parallelism, and tool-heavy research coordination.7
I wanted a repo that made those trade-offs visible instead of theoretical.
What multi-agents-team is
MAT or multi-agents-team is a Next.js playground for multi-agent coordination patterns.2 The live version is at mat.umai-tech.com.3 The same user request can run through nine different architectures:
multi-agents-team
Nine architecture pages, one event contract
Each card links to the live architecture view on the playground. The point isn’t that one pattern wins — it’s that the same task can move through different coordination shapes and still stream back through the same UI.
Orchestrated
Central coordinator
plan → delegate → synthesize
A coordinator routes work to research, writer, and editor specialists.
Choreographed
Peer message bus
round-robin negotiation
Backend, frontend, and design peers coordinate through shared messages.
Hierarchical
Dynamic agent tree
lead → sub-agents → rollup
A lead spawns depth-capped sub-agents and synthesizes their results.
Evaluator–Optimizer
Critique loop
draft → score → revise
A generator improves a draft until a critic accepts the quality bar.
Debate
Adversarial panel
argue → rebut → judge
Opposing agents argue their case before a judge synthesizes the answer.
Blackboard
Shared workspace
select agent → write board
A controller chooses which specialist updates the shared board next.
Market
Auction board
post task → bid → award
Agents bid on work; the dispatcher awards tasks to the strongest fit.
Self-Consistency
Parallel sampling
sample N → select/merge
Several attempts run in parallel and a judge selects or merges the best.
Swarm
Shared scratchpad
many passes → convergence
Identical agents build on a shared scratchpad over capped rounds.
Why nine architectures?
Nine isn’t a magic number — it’s the smallest set that covers the coordination axes I cared about: who decides (a central coordinator, negotiating peers, or a market), how state is shared (handoffs, a blackboard, a common scratchpad), and how quality emerges (critique loops, debate, parallel sampling). Each pattern stresses the harness differently — handoffs, shared boards, bids, parallel samples — but every run has to emit the same AgentEvent stream. If the harness can carry all nine without special cases, the event contract is probably right.
The project is intentionally hands-on. Every mode has its own streaming API route. Every run emits a shared AgentEvent stream: workflow starts, iteration starts, agent steps, tool calls, handoffs, blackboard updates, bids, traces, samples, and final completion. The UI renders that stream live so you can watch the system reason, not just read the final answer.
The first harness is deliberately simple:
- Next.js API routes own the request lifecycle.
- The Vercel AI SDK handles model calls, tool definitions, and streaming-friendly primitives.
- A per-run conversation object owns the message bus.
- Provider selection is request-scoped with
AsyncLocalStorage, so a user’s API key doesn’t leak across concurrent runs. - Chat history lives in the browser.
- Cost is estimated and displayed, but not enforced.
- Human input exists, but the early version is in-memory and single-process.
original harness architecture
The whole first harness lived inside one app
Click a block to inspect the job it owns — the detail panel below the diagram updates in place.
mat.umai-tech.com
Browser chat UI
The public app owns the mode selector, provider settings, local chat history, live timeline, tree visualizations, and final rich summaries.
Responsibility
User-facing control plane
This was the right first move. It kept the coordination logic close to the app. If I wanted to understand why a debate run behaved differently from a blackboard run, I could open the runner file and read it. No distributed system. No platform ceremony. No second runtime.
Could I have skipped the hand-rolling?
Probably. openharness is an open-source SDK that builds exactly this kind of harness on top of the Vercel AI SDK — stateless agents, composable middleware, tool-permission callbacks, context compaction, subagent hierarchies.8 If I’d wanted a harness off the shelf, that’s where I’d have started. But the point of this repo was the opposite: build each harness job by hand first, so I’d understand what I was buying when I later reached for a substrate like iii.
But as soon as I started asking production questions, the harness got louder.
The limits of the hand-rolled harness
The in-app harness is honest about what it is: a readable local runtime for exploring coordination patterns. That’s a feature. It’s also the boundary.
For a public demo, browser-local history is fine. For a customer workflow, server-side sessions matter. For a toy web search tool, “the user clicked run” is enough permission. For a production agent touching internal systems, every tool call needs policy. For a blog-worthy demo, estimated cost is enough. For a real product, budget caps need to stop the run, not just decorate the UI.
The uncomfortable part is that none of these problems are exotic. They’re the normal jobs around an agent run:
One app, nine jobs
Turn loop
Events to UI
Tools
Policy
Budget
Sessions
Human approval
Observability
Deployment
That last row is the reason I still like the original harness. One app is hard to beat. But the middle rows are exactly why I became curious about iii.
What iii is
iii — source-available on GitHub9 — is a Rust engine built by the team behind the Motia backend framework.110 The repo sits at roughly 17k stars, the engine is licensed under ELv2 with Apache-2.0 SDKs,11 and the docs describe a system built from three primitives — workers, functions, and triggers — with state, streams, and queues shipped as built-in workers.12 The important mental model comes from the founder’s own harness write-up:
— Mike Piccolo • How to Build Your Own Agent Harness, iii (May 2026)Pick the workers. Write the missing ones. Compose. The harness is the composition.
In iii, workers connect to an engine over WebSocket and register functions and triggers. The engine routes calls, manages worker connections, exposes modules for HTTP/state/queues/streams, and gives the system a common substrate.13
Three primitives, one interface
Worker
hosts the workAnything that opens a WebSocket to the engine and registers functions and triggers. A queue, a scheduler, an HTTP edge, a browser tab, an agent, a sandbox — each is a worker.
Workers run anywhere — a laptop, a container, a browser tab, a microVM — in any language that can hold a WebSocket.
mat-iii-worker · iii-state · iii-http
Function
is the workA named handler inside a worker: payload in, result out. Function IDs follow a service::name convention, so they stay stable across worker restarts and language boundaries.
Any worker can call any function through the engine — which is what makes a harness job swappable: register the same ID, replace the layer.
math::add · policy::check_permissions
Trigger
starts the workWhat causes a function to run. A trigger has a type — HTTP, cron, queue message, state change, or another function calling trigger — a configuration, and the function ID it invokes.
The same function can sit behind several triggers: an HTTP route for the app, a cron for batch runs, a queue for durability.
POST /run → turn::run
Engine
routes between themThe engine is the coordinator: it accepts worker connections over WebSocket, keeps a live registry of every registered function and trigger, and routes each invocation to whichever worker currently provides that function ID. Nothing talks to anything directly — every arrow in the topology is a worker-to-engine connection.
That changes the harness shape. Instead of one application process owning every concern, the harness can be decomposed:
- A provider worker streams model output.
- A policy worker checks permissions.
- A budget worker records and enforces spend.
- A state layer owns sessions.
- A queue keeps long-running work alive after the initial request.
- A trace layer gives you OpenTelemetry spans across the run.
- A custom worker can replace one layer by registering the same function IDs.
The iii blog post frames this as “build your own agent harness.”14 I think the stronger interpretation is: don’t confuse the harness with the application framework.
The harness is the set of jobs that make an agent safe, durable, observable, and operable. iii’s bet is that those jobs should be workers on a shared bus, not hidden inside one framework object.
The primitive shift
The powerful idea isn’t “use this one agent framework.” It’s seeing queues, streams, state, model providers, policy gates, approval surfaces, browser tabs, and business services as the same kind of thing: workers. If a capability is missing, add or replace a worker. Don’t keep changing the core engine.
Contributor note
This is the Go SDK work I mentioned up top: it sits alongside the Node, Python, and Rust SDKs.15 It’s also why the primitive argument feels concrete to me — once the engine contract is stable, another language becomes an SDK layer, not a rewrite of the engine.
Testing iii inside my repo
The useful thing about my multi-agents-team setup is that the app now has two backend paths.
The default path is still the in-app harness. The chat UI sends a request to the relevant /api/agents-v* route, the route validates credentials, then the local runner streams events back to the browser.
The iii path keeps the same front-end contract but changes where the turn runs:
- The UI sends the same mode, model, provider, message, history, and conversation ID.
- The API route sees
backend: "iii". - The app posts the turn to the iii engine’s HTTP trigger.
- A worker runs the existing agent loop and emits the same
AgentEventshape. - The Next.js app forwards those events back to the same chat UI.
That last point matters. I didn’t want to rewrite the product around iii. I wanted to test whether the harness layer could move while the visible app stayed stable.

The visible product stays the same: the user selects a model, a coordination pattern, and an execution backend. The iii engine path is a harness swap behind the chat UI, not a second product.
The adapter does three pragmatic things:
- It fails closed when no iii engine is configured, with an actionable error and a path back to the in-app harness.
- It accepts live SSE from the engine when the worker streams events over the HTTP response.
- It also supports a queued path where a run gets a
runId, keeps executing on the engine, and the app polls events until completion.
There is also a small policy bridge. My tools are still defined inline in the agent factories. I didn’t want to couple those tools directly to the iii SDK. Instead, the iii worker installs a request-scoped policy checker; tools call a local policyCheck() function; the in-app backend has no checker, so it behaves as before. On the iii path, the check can forward to policy::check_permissions.
iii’s own reference harness makes the same call at a different layer: it ships as roughly fourteen workers in a separate workers repo, and its policy gate fails closed — a five-second timeout counts as a deny.16
That’s the migration pattern I like: keep the app’s internal abstractions stable, move one harness concern at a time.
How I deployed the iii harness on Fly.io
The Next.js app can live happily on Vercel because the in-app harness is just route handlers. The iii backend is different. It’s a long-running engine plus a worker that needs to stay connected to that engine over WebSocket, so I deployed that part separately on Fly.io.17
The high-level shape is:
- Vercel: serves the public
mat.umai-tech.comNext.js app. - Fly.io: runs
mat-iii-engine, a Docker image bundling the iii engine and myiii-worker. - Inside the Fly machine: the worker connects to the local engine bus on
ws://localhost:49134. - Public edge: the engine exposes the worker’s
POST /runandGET /healthtriggers over HTTPS on port3111. - Shared secret: Vercel sends
III_ENGINE_TOKENas a bearer token; the Fly worker checks the same secret before accepting a run.
The important deploy detail: I run the Fly app as one machine. In this version, queued run events and stream state live in that engine process. If Fly sends POST /run to one machine and GET /events to another, the app can lose the run’s event stream. So the Fly config keeps min_machines_running = 1, disables auto-stop, and the deploy should stay single-instance until the state/stream layer is externalized.
The setup is intentionally boring:
# Create the Fly app without deploying yet.
fly launch --no-deploy
# Set the shared secret used by the Next.js app and the iii worker.
fly secrets set III_ENGINE_TOKEN=$(openssl rand -hex 32)
# Keep this single-machine for now: queued events/state are process-local.
fly deploy --ha=false
# If Fly ever scales it up, force it back to one machine.
fly scale count 1That gives me the split I wanted: Vercel keeps serving the product UI, Fly runs the long-lived harness substrate, and the two meet at one explicit seam: POST /run.
This is a phased migration
I wouldn’t describe this as “the app is now an iii app.” The honest description is better: the app has a working in-app harness and an iii backend path for testing production harness concerns without throwing away the original runners.
My harness vs iii
Here is the comparison I wish I had when I started.
Thin app harness vs worker substrate
The original harness wins on approachability. It’s easy to clone, run, read, and modify. It’s the harness I’d show someone who wants to learn how multi-agent patterns actually behave.
iii wins on separation of concerns. Policy shouldn’t be sprinkled through random tool functions. Budget enforcement shouldn’t be a UI label. Long-running agent state shouldn’t depend on a browser tab or a serverless request staying alive. If those are your problems, iii’s worker model starts to make sense.
But this isn’t free.
What feels better with iii
The biggest improvement is that the harness jobs become explicit.
In my first version, “the harness” was spread across routes, runners, event types, provider utilities, local storage, and some UI assumptions. It worked because the app was small and the author was me. That’s not a production architecture principle.
With iii, a policy layer isn’t an afterthought — it’s a function call. A budget layer isn’t a comment — it can be a worker. The approval path doesn’t have to be a React component — it can be another system that writes a decision into state. A Slack approval surface and a console approval surface can both call the same underlying function.
That composability is the interesting part. The claim isn’t “iii has the perfect default harness.” It’s that a harness is easier to evolve when its responsibilities are connected by stable function IDs instead of fused into one framework.
It also makes partial migration realistic. I can keep the Vercel AI SDK where it is useful. I can keep my nine runners. I can keep my event schema. Then I can move durability, policy, budget, and trace concerns into iii one by one.
That’s a good engineering shape — and it’s the old, proven one. Small primitives with one interface compose; big frameworks with many interfaces accrete. When the harness is a set of functions on a bus, “we need budget enforcement now” is a worker you write this week, not a feature request on someone else’s roadmap. That’s the real power of primitives: composability turns the harness from a product you adopt into an architecture you evolve.
What still hurts
The cost is operational complexity.
The in-app harness has one deployable. The iii path introduces another runtime: engine configuration, worker registration, auth token, health checks, event transport, timeout behavior, and failure modes between the app and engine.
That’s not a reason to reject it. It’s a reason to be honest about when you need it.
If I’m teaching multi-agent patterns, I don’t want a distributed system in the way. If I’m running a weekend experiment, I don’t need a budget worker. If the tool is just web search against a user’s own API key, a full policy engine may be more architecture than product.
But if the agent can call internal tools, mutate business state, spend real money, or run for minutes in the background, the “simple” harness starts hiding risk. At that point, the second runtime may be cheaper than the pile of bespoke code you were about to write badly.
The trade-off
A thin harness optimizes for momentum. A thick harness optimizes for control. The mistake is pretending one of those is always the grown-up answer.
Where iii sits in the harness landscape
iii isn’t entering an empty field — every serious agent stack already answers the harness question somewhere. LangGraph checkpoints graph state inside your process.18 Microsoft folded AutoGen and Semantic Kernel into one agent framework with an actor-style runtime.19 OpenAI’s Agents SDK keeps the loop thin and in-process,20 while Temporal ships an official integration that turns that same loop into durable workflow code.21 Anthropic packages its own coding harness — loop, tools, permissions, subagents — as the Claude Agent SDK.22 Pi stakes out the same thin pole even more aggressively: a deliberately minimal harness you adapt to your workflows instead of the other way around.4 OpenHands wraps each coding session in a sandboxed per-session runtime.23 Restate, Inngest, and Hatchet sell durable execution that agent builders increasingly borrow.242526
The differences aren’t feature lists. They’re answers to one question: which layer of your system owns the harness jobs?
Five answers to where the harness jobs live
In-process framework
Thin SDK
Durable-execution substrate
Session platform
Worker bus
iii’s nearest neighbors are the durable-execution substrates. The difference is the unit of composition: Temporal and friends give you durable workflows you write; iii gives you a live function registry on a bus, where each harness job is a worker you can replace — including with one written by another team, in another language. That’s a stronger composition story and a younger ecosystem, and both halves of that sentence matter.
What I would want from iii next
Reading my integration against iii’s current docs and harness write-up, I wouldn’t frame the gaps as “iii lacks primitives.” The primitives are the interesting part. The next step is productized harness ergonomics: fewer custom adapters, clearer production defaults, and stronger contracts around runs, events, and governance.1214
The primitives are strong; the defaults can get sharper
A first-class run contract
Browser-friendly event reads
Worker lifecycle guardrails
Production HA recipes
Governance starter packs
Replay and evals as primitives
That distinction matters. A worker substrate can always say “write a worker.” That’s powerful, but it’s also where early adopters pay the integration tax. The more iii can package the common harness seams, the easier it becomes to recommend for teams that aren’t trying to become harness experts.
The slider, not the religion
The best framing I have found is that the harness is a slider.
At one end: a thin loop around an LLM call. Great for exploration. Low ceremony. Easy to debug. Almost no production guarantees.
At the other end: durable queues, server-side state, policy gates, human approvals, spend caps, worker-level traces, and provider abstraction. More moving parts. More control.
My in-app harness lives closer to the thin end. iii lets the system move toward the thick end without forcing every concern into the same application process.
That’s the part I find compelling. Not “replace your app with iii.” Not “frameworks are dead.” More practical:
Keep the agent logic where it is easiest to reason about. Move the cross-cutting harness jobs to a substrate when those jobs become real.
For multi-agents-team, that means the original harness remains the default. It’s the best path for trying the nine patterns and understanding coordination trade-offs. The iii backend is the experiment for what happens when those same patterns need production properties.
And honestly, that’s how most AI systems should evolve. Start with the thinnest harness that teaches you something. Don’t cargo-cult a platform on day one. But when the run starts touching real systems, stop pretending a route handler and a clever prompt are enough.
References
Footnotes
-
iii, iii homepage. ↩ ↩2
-
Marcus Elwin, multi-agents-team GitHub repo. ↩ ↩2
-
Marcus Elwin, multi-agents-team live demo. ↩ ↩2
-
Earendil, Pi — “Pi is a minimal agent harness. Adapt Pi to your workflows, not the other way around.” ↩ ↩2
-
OpenClaw, openclaw.ai — open-source personal AI assistant that runs on your own machine and connects to WhatsApp, Telegram, and other chat apps. ↩
-
Vercel, AI SDK documentation. ↩
-
Anthropic Engineering, “How we built our multi-agent research system”. ↩
-
Max Gfeller, openharness — a composable agent-harness SDK built on the Vercel AI SDK. ↩
-
iii HQ, iii GitHub repo. ↩
-
iii documentation, “Migrating from Motia”. ↩
-
iii HQ, iii GitHub repo: the engine is licensed under the Elastic License 2.0; the SDKs, CLI, and console are Apache-2.0. ↩
-
iii, documentation. ↩ ↩2
-
iii documentation, “Engine” and “Channels”. ↩
-
Mike Piccolo, “How to build your own agent harness”. ↩ ↩2
-
iii HQ, Go SDK package. ↩
-
iii HQ, workers repo, including the reference harness workers. ↩
-
Fly.io, documentation. ↩
-
Microsoft, Agent Framework GitHub repo. ↩
-
OpenAI, Agents SDK documentation. ↩
-
Temporal, documentation. ↩
-
Anthropic Engineering, “Building agents with the Claude Agent SDK”. ↩
-
All Hands AI, OpenHands GitHub repo. ↩
-
Restate, restate.dev. ↩
-
Inngest, AgentKit documentation. ↩
-
Hatchet, hatchet.run. ↩
Was this helpful?
Let me know what you think!