The Moat Isn't Your Model — It's Your Harness and Data Flywheel
Everyone is building agents. Almost nobody is building what makes agents defensible. We explore why the agentic harness and data flywheel — not the LLM — are where durable competitive advantage lives in 2026 and beyond.
Bloomberg spent ~$10M building its own 50B-parameter model. GPT-4 outperformed it on most financial tasks. And yet the Bloomberg Terminal — $28–32K per user per year, 325K subscribers, near-zero churn — remains untouchable.1 The moat was never the model. It was decades of curated data, 90+ proprietary models, and workflow integration so deep no horizontal player can pry it loose.
That gap — between who has the best model and who has the most defensible product — is the story of AI in 2026.
Since ChatGPT-3.5 we’ve moved through three architectural waves: stacked LLM calls with prompt engineering and RAG, then agentic loops for things like legal review, and now what people are calling the agentic harness — the runtime that wraps a model with tool use, memory, guardrails, and workflow integration. The vocabulary shifted from prompt engineering to context engineering, from agentic loop to agentic harness. Underneath, the same observation: the LLM is one component in a larger system, and the system is what creates competitive advantage.
Models are commoditized. Harnesses are not. The data flywheel — the feedback loop that turns every user interaction into a proprietary training signal — compounds in a way no API swap can replicate. This post is about why that’s where durable advantage now lives, and what it means if you’re building. We’ll look at the model-commoditization curve, the platform giants’ move into every vertical, the architecture of the harness itself, the flywheel in production at Stripe and Bloomberg and NBIM, and why vertical SaaS isn’t just surviving the AI wave — it’s the structural winner.
The Model Layer Is Commoditizing — Fast
Since ChatGPT’s launch in late 2022, we’ve witnessed a tremendous pace of model commoditization and democratized access to intelligence that few predicted would happen this quickly. What started as a single breakthrough has become an industry-wide race to the bottom on pricing, with capabilities that were once exclusive now available to anyone with an API key. As illustrated below we are seeing drastic cost reductions and rapid performance convergence across the board — a textbook case of commoditization playing out in real time:
The Model Commoditization Curve
Frontier model pricing collapse: November 2022 to April 2026
Chart shows output token pricing trends for frontier LLMs from November 2022 to April 2026.
Models tracked: OpenAI (GPT-3.5, GPT-4, GPT-4 Turbo, GPT-4o, GPT-5.2, GPT-5.5), Anthropic (Claude 2, Claude 3 Opus, Claude 3.5 Sonnet, Claude Opus 4, Claude Opus 4.7), Google (Gemini 1.0, Gemini 2.0 Pro, Gemini 3.1 Pro), and DeepSeek (R1).
Price index methodology: Normalized to GPT-4 March 2023 pricing as baseline (index = 100). Each data point represents the average output token price for comparable frontier-tier models at time of release, adjusted for relative performance on standard benchmarks (MMLU, HumanEval, GSM8K, SWE-bench).
Key inflection points: DeepSeek R1 (Dec 2024) triggered a 43% single-quarter price drop by undercutting proprietary models by ~90%; Claude 3.5 Sonnet demonstrated mid-tier models matching flagship performance at lower cost; GPT-4 Turbo introduced tiered pricing; April 2026 saw intense competition with GPT-5.5 ($5/$30 per million tokens) and Claude Opus 4.7 ($5/$25) both achieving near-parity pricing at frontier performance levels.
Sources: OpenAI API Pricing (GPT-5.5 launch April 23, 2026), Anthropic API Pricing (Opus 4.7 launch April 16, 2026), Google AI Pricing, DeepSeek Pricing Docs, BenchLM.ai LLM Pricing Trends, Epoch AI Price Performance Analysis (2025-2026), Menlo Ventures Enterprise API Market Share Report (Mid-2025).
The numbers tell a relentless story. According to Epoch AI, the price to achieve GPT-4-level performance is falling roughly 40x per year — with some benchmarks showing drops as steep as 900x annually2. GPT-4 output token pricing has plummeted over 90% since March 2023. DeepSeek R1 matched OpenAI’s o1 reasoning performance for a training cost of just ~$5.6M versus the estimated $100M+ for comparable proprietary models. Open-source models now deliver approximately 90% of proprietary capability at roughly 17x lower cost, with the quality gap shrinking from 15–20 points to just 9 points between October 2024 and mid-2025. Parity is projected by Q2 2026.
The enterprise response has been decisive: multi-model deployment based on user intent and problem complexity is now standard operating procedure. 37% of enterprises use five or more models in production, and 69% of companies using Google models simultaneously use OpenAI3 — treating models as interchangeable components rather than strategic differentiators. Enterprise API spending doubled from $3.5B to $8.4B in just six months, even as per-token prices collapsed. Volume is exploding while unit economics race toward zero.
It’s not just inference cost that’s commoditizing — reasoning capability itself is following the same curve. OpenAI’s o3-mini democratized “PhD-level reasoning” at roughly 15x cheaper than o1. Gartner projects a 90% cost reduction by 2030 for inference on trillion-parameter models4. The provider landscape is fragmenting rapidly: Anthropic now commands 32% of enterprise API spend, surpassing OpenAI, which fell from 50% to 25%5.
Even the leaders building these models acknowledge the shift. Sam Altman of OpenAI admits, “We will maintain less of a lead than we did in previous years.”6 Satya Nadella of Microsoft observes, “AI is turning into a commodity we just can’t get enough of.”7 Nandan Nilekani of Infosys states plainly, “The models will become more commoditized and the value will switch to the application layer.”8
— Jared Spataro •MicrosoftThe LLM is the CPU. As powerful CPUs became commodities, the value shifted to the overall system.
If the model is the CPU, the critical question becomes: what is the actual computer? What serves as the motherboard, the memory, the storage, the I/O controllers? And more importantly — how do you architect and build it to create durable competitive advantage? When both horizontal LLM providers and open-source models are racing toward parity, the moat has to live somewhere else. The next two sections argue that the harness and the flywheel are where that “somewhere else” is.
The Platform Giants Are Coming for Every Vertical
The last twelve months have seen an unprecedented horizontal-to-vertical land grab. OpenAI, Anthropic, and Google aren’t just building better models — they’re launching dedicated products that compete directly with vertical SaaS incumbents across legal, design, coding, healthcare, finance, and commerce.
The pattern is clear: acquire or build specialized tools, integrate them into the platform, and ride the distribution advantage. When Anthropic launched its first Claude Legal plugin in February 2026, the market reacted immediately — Thomson Reuters fell 16%, RELX fell 14%, and Wolters Kluwer fell 13% in a single session, wiping out an estimated $285 billion in market value from software and legal technology companies. A week later, when Claude Design launched (April 17, 2026), Figma’s stock fell 7% as investors realized Anthropic was gunning for their core market. Then in May 2026 Anthropic escalated: Claude for Legal went open on GitHub with 12 practice-area plugins, 80+ specialized agents, ~20 MCP connectors, and a Managed Agents API9 — picking up 882 stars and 165 forks in 24 hours. The Feb release was a probing shot. The May release was a vertical product, free, model-native, and distributed to every Claude user on day one.
OpenAI is taking a different approach through acquisitions: Windsurf IDE ($3B, 2025) for coding, Torch Health ($100M, January 2026) for healthcare, and Hiro Finance (April 2026) for personal finance. Google is embedding Gemini directly into Workspace and partnering with enterprises like Citi for “Citi Sky,” an AI-powered wealth advisor. And it isn’t just the US labs — ByteDance’s TRAE has shipped its own coding harness, framing itself explicitly as “SOTA Model (Wild Horse) + Harness (Control System)”10 and going head-to-head with Cursor, Claude Code, and Windsurf. The vertical land grab is now global.
The matrix below scores each platform’s incursion into each vertical from 0 to 100, using market reaction (stock drops, market-cap shifts), strategic investment, product maturity, and distribution advantage as inputs. Click any cell to see the underlying play and sources:
Platform Giants: Verticalization Risk Matrix
How OpenAI, Anthropic, and Google are entering vertical markets (April 2026)
| Vertical | OpenAI $852B valuation | Anthropic Claude Opus 4.7 | Google Gemini 3.1 Pro |
|---|---|---|---|
| Legal | |||
| Design | |||
| Coding | |||
| Healthcare | |||
| Finance | |||
| Commerce |
Methodology: Risk scores (0-100) assess the threat level each platform play poses to incumbent vertical SaaS providers, based on: market reaction (stock drops, market cap impact), strategic investment size, product maturity, distribution advantage, and timing. Click any cell to view detailed sources and analysis. Scores reflect verticalization risk as of April 2026.
Key takeaways from the verticalization risk matrix:
- Design is high-risk (Anthropic: 85/100) — Claude Design, Code to Canvas, and Figma plugins go directly at a $50B+ market where Figma’s main moat is workflow rather than data depth. Anthropic’s CPO Mike Krieger stepped down from Figma’s board three days before launch.
- Coding is the most contested category (OpenAI: 80, Anthropic: 75, Google: 65) — Windsurf IDE, Claude Code, and Gemini Code Assist all compete for the same developer mindshare. Cursor at $50B valuation shows what a harness moat looks like in the same space.
- Legal is high-risk but well-defended (Anthropic: 72/100) — May 2026’s Claude for Legal shipped 12 plugins, 80+ agents, and 20 MCP connectors as one open-source release. But the incumbents moved faster: Harvey and Legora are both past $100M+ ARR with custom legal embeddings, iManage/LexisNexis integrations, and firmwide rollouts at Bird & Bird, Cleary, Linklaters, BCLP, and HSBC.
- Healthcare and Finance are acquisition plays (OpenAI: 70–75) — Torch Health and Hiro Finance buy clinical and personal-finance distribution OpenAI can’t grow organically; Google counters with Gemini Healthcare’s air-gapped enterprise deployments.
But the risk scores above aren’t destiny. The companies most exposed to this land grab are the ones whose product is a thin layer over a foundation model — generic chat UIs, surface-level prompt wrappers, or features a platform can replicate in a weekend launch. Conversely, vertical players that have built genuine stickiness through workflow integration, exclusive datasets, regulated data pipelines, embedded financial infrastructure, and compounding data flywheels are structurally insulated. A platform giant can ship a “legal mode” or “finance mode,” but it cannot ship decades of jurisdiction-specific eval suites, customer-curated ground truth, or the EHR/CRM/checkout integrations that took years to negotiate. The further your moat sits from the model layer, the lower your effective risk — regardless of what the matrix says about your vertical.
Legal is also the cleanest test case for the thesis. Two forces are pulling on the category at once: Anthropic from above with Claude for Legal (free, distributed to every Claude user), and Mike OSS from below — an AGPL-v3 open-source clone built in two weeks by a former Latham solicitor that testers say reaches roughly 80% feature parity with Harvey and Legora at zero license cost.11 If the moat were the model, both pressures would be terminal. But the category leaders are still winning, because they’ve been building the parts of the moat that neither a plugin nor an OSS clone can replicate. Harvey crossed $100M+ ARR with 700+ orgs including 45 AmLaw 100 firms and HSBC, and trained its own legal embeddings (voyage-law-2-harvey) on 20B+ tokens of legal text, paired with iManage and LexisNexis integrations.
Legora crossed $100M+ ARR at a $5.6B valuation, with firmwide rollouts at Bird & Bird, Cleary Gottlieb, Linklaters, and BCLP — collaborative multi-lawyer architecture, firm-specific playbooks, 18-month enterprise sales cycles already in motion. The model commoditizes. The embeddings, the workflows, the firm-specific data, the integrations, the enterprise procurement scars — those don’t. Same lesson as the radar: the moat has to live somewhere the platform and the open clone can’t reach. Harvey and Legora are betting their futures on living there.
The Message Is Clear
Horizontal players want your vertical. But wanting it and owning it are very different things. The risk scores above are ceilings, not floors — companies with deep harnesses and live data flywheels operate well below them. The next section explains why.
Why the Harness Creates the Moat, Not the Model
The dominant narrative in 2023–2024 was that the model was the product — bigger weights, better data, smarter architectures. Anything wrapped around it was “just engineering.” Two years later that framing has flipped, and the most useful frame for what changed comes out of Stanford NLP.
Since 2020, Khattab and collaborators had been building compound LM systems — ColBERT-QA, Baleen, Hindsight — that consistently outperformed monolithic models on the same hardware. That research line produced DSPy in October 202312: treat prompts as compilable artifacts and pipelines as optimizable programs tuned against a target metric. Four months later, the BAIR essay “The Shift from Models to Compound AI Systems” — co-authored by Zaharia and Khattab — gave the movement its name.13
— Matei Zaharia et al. •Berkeley BAIR Lab, Feb 2024State-of-the-art AI results are increasingly from compound systems with multiple components, not monolithic models.
Addy Osmani put the market consequence of that shift more bluntly a year later: the industry is moving from LLM APIs (which return completions) to Harness APIs (which return a runtime).14
— Addy Osmani •Google Chrome, May 2026Agent = Model + Harness. If you’re not the model, you’re the harness. A decent model with a great harness consistently beats a great model with a bad harness.
A simpler model I like to use to describe this before the deep dive:
What do we mean by an agent + harness?
- Instructions
- Model
- Tools
- Memory
Everything around the model — the instructions it follows, the tools it can call, and the memory it keeps between turns.
A simplified view
Everything around the model — the instructions it follows, the tools it can call, and the memory it keeps between turns.
The rest of this section unpacks what each of those wrappers actually does in production, and why their depth — not the underlying model — is where the moat lives.
What Exactly Is the Agentic Harness?
The harness is everything that surrounds the LLM in production:
- Tool integration — browsers, terminals, APIs, file systems
- Context & state management — memory, session persistence
- Evaluation frameworks — continuous output verification against business rules
- Observability pipelines — hallucination, drift, failure monitoring
- Security guardrails — deterministic validation of probabilistic outputs
A useful architectural distinction: scaffolding is what happens pre-execution — system prompt compilation, tool schema registration, subagent setup. The harness is the runtime orchestration layer that wraps the reasoning loop, dispatching tools, managing context, and enforcing safety. The OpenDev paper (2026) formalized this with a five-layer defense-in-depth safety model.15 TRAE’s harness guide frames the runtime more vividly: a REPL container wrapping the non-deterministic LLM “brain” with deterministic Read → Eval → Print → Loop boundaries — context assembly, tool dispatch, observation re-injection, repeat.10
The central architectural move is decoupling probabilistic reasoning from deterministic execution. The LLM (“Cognitive Engine”) drafts a probabilistic plan; a “Reasoning Orchestrator” validates that plan against business rules, access controls, and compliance requirements before anything executes. In the diagram below the LLM sits as one numbered step inside the harness — it’s the only step the harness doesn’t own.
The Pattern Has Had Several Names
“Agentic harness” is the 2026 label for a pattern the ML world has been naming and re-naming for a decade. Worth tracing the lineage so it’s clear this isn’t a new invention — just the current vocabulary:
- 2015 — “ML system” (Sculley et al., NeurIPS). Hidden Technical Debt in Machine Learning Systems16 popularized the now-canonical diagram: a tiny “ML Code” box surrounded by a much larger sprawl of configuration, data collection, feature extraction, monitoring, serving infrastructure. Same insight, different model class: the model is small, the system around it is large.
- 2018–2022 — “MLOps stack.” Uber’s Michelangelo, Google’s TFX, Databricks’ end-to-end pipelines. Feature stores, model registries, serving, drift monitoring. The infrastructure got named and productized, but the architectural shape stayed the same.
- Feb 2024 — “Compound AI Systems” (Zaharia et al., BAIR). Pulled the pattern into the LLM era: state-of-the-art results increasingly come from systems with multiple components, not monolithic models.13
- 2023–2024 — “Cognitive architecture.” Coined by Flo Crivello, popularized by Harrison Chase / LangChain17 and formalized in Sumers et al., Cognitive Architectures for Language Agents (arXiv 2309.02427). Reframed the same wrapping-layer as how the agent thinks: memory, planning, action selection.
- 2024 — “Agentic loop / agentic workflow.” The vocabulary that took hold once tool-using agents started showing up in legal review, coding, customer support.
- 2025–2026 — “Agentic harness.” Mitchell Hashimoto, Addy Osmani, Aparna Dhinakaran, TRAE.10 Same pattern, sharper framing: the harness is what wraps the LLM and makes it deployable.
Same shape, six labels. What’s new in 2026 isn’t the architecture — it’s that the LLM moved from being the system to being one component of the system, and the rest of the components (evals, memory, tool dispatch, guardrails) finally got first-class treatment in production.
The diagram below is the 2026 canonical depiction — the same wrapping-layer pattern Sculley drew around a tiny ML model in 2015, redrawn around a probabilistic LLM core. Seven numbered steps in the main flow, three supporting layers (memory, guardrails, observability), and one component you don’t own: the Cognitive Engine. Click any node to see what it actually does and why depth there creates defensibility.
Click any component to explore how it creates defensibility
A production-grade harness has to deliver on four pillars — TRAE’s R.E.S.T. framework10 is a useful checklist:
- Reliability — fault recovery from checkpoints, idempotent write operations, behavioral consistency under identical inputs
- Efficiency — strict budgets for tokens/API calls/compute, low-latency first response, high throughput for batch work
- Security — least-privilege tool permissions, sandboxed execution for untrusted code, I/O filtering against prompt injection and PII leaks
- Traceability — end-to-end call chains, explainable decisions with attribution, auditable state at any historical point
Without these, you don’t have a harness — you have a chatbot wrapped in JSON.
Between 2025 and 2026 the case for harness engineering moved from theory to empirical fact:
Three signals that settled the debate
- Same model, different harness, wildly different rank. Claude Opus 4.6 placed #33 in Terminal Bench 2.0 inside its native Claude Code harness — and #5 in a different one.18 Princeton HAL confirmed it across 21,730 rollouts: the optimal scaffold flips by model family.19
- Independent teams converged on the same architecture. Cursor, Claude Code, Windsurf, Codex, and Arize’s Alyx — different companies, different domains — landed on near-identical harness designs.20 When the shape repeats, the shape is the discipline.
- Capital is pricing it in. Cursor reached $50B on the same underlying models as its competitors. Y Combinator’s Garry Tan calls evals “the real moat for AI startups.” Tidemark Capital is blunter: “The model is replaceable. Your orchestration layer isn’t.”21
The model is the rented car; the harness is the road network. A faster car gets you nowhere if the roads don’t reach your destination. The same gap shows up across five moat dimensions — durability, switching cost, data defensibility, domain depth, and compounding — when you put model-centric and harness-centric strategies side by side:
Model Moat vs. Harness + Flywheel Moat
Model-Centric Moat
Harness + Flywheel Moat
Competitors replicate within months; open-source catches up fast
Accumulates domain logic, evals, and workflows over years
API swap is trivial — change one line of code
Ripping out = ripping out the workflow itself
Trained on public/purchased data; no proprietary loop
Proprietary feedback loop; data created through product usage
General-purpose; surface-level across all verticals
Deep, jurisdiction-specific compliance and domain expertise
Improves with scale but gains are shared industry-wide
Every interaction makes the system smarter — gap widens daily
I’ve worked through three model paradigms — classical ML, then deep learning, now agentic systems. The vocabulary keeps changing; the underlying lesson hasn’t. The model is the engine; the harness is what makes it useful in production. Without one, even a frontier model is a wild horse — fast, expensive, and headed somewhere you didn’t ask it to go. The companies investing in the harness now are the ones whose products will still be hard to replace when the next 40x price drop lands.
The Data Flywheel: Your Only Durable Advantage
General internet-scale data is abundant but increasingly reused and shared — every model trains on roughly the same corpus. And the supply is now effectively unbounded: Hugging Face’s Synthetic Data Playbook and earlier work like Cosmopedia have shown you can generate trillions of high-quality synthetic tokens for pre-training at near-zero marginal cost.22 If anything, that reinforces the thesis — commodity data flowing into commodity models. What moves the needle is specialized, high-signal data generated in real time inside closed-loop systems that only your product can create.
The five-stage loop below is the mechanic: every interaction turns into proprietary data, which trains better models, which make a better product, which drives more usage, which produces more data. Twelve to twenty-four months in, the gap between you and a new entrant becomes uncrossable:
The Data Flywheel
Inner loop = what feeds the flywheel. Outer ring = what compounds at scale. Click any node for details.
Click a stage to see details
Flywheel Building Blocks
- HITL workflows — every correction a domain expert makes (a lawyer flagging a clause, a radiologist re-labeling a finding, a fraud analyst overturning a decision) is gold-standard labeled data no scraper can replicate. Crucially, this is signal you generate as a byproduct of doing the work — not data you have to pay to acquire.
- Workflow integration — AI running inside the EHR, the CRM, or the checkout flow sees the full context: who the user is, what step they’re on, what just happened upstream. A platform plug-in sees a fragment of that, and only sometimes. Owning the surface means owning the context, and owning the context is what makes the inferences both better and defensible.
- Low-latency feedback — the shorter the loop between observation, correction, and re-training, the faster your edge compounds. A team that ships eval improvements weekly will pull away from one that ships quarterly, even with identical models. Stripe’s quarterly compounding on payments fraud is the clearest live example.
- Proprietary generation — the most valuable training data is the kind that only exists because your product exists. Klarna’s agentic protocol queries, Revolut’s cross-surface event sequences, Bloomberg’s analyst annotations — none of these can be reconstructed from scraping the web. They’re synthesized by users interacting with the workflow your product owns.
Network Effects on Top of the Flywheel
The bullets above describe what feeds the flywheel. What the flywheel produces — once you have enough users on it — is a second compounding layer that no single-tenant deployment can match. Three effects stack on top of each other:
- Cross-cohort learning. A novel card-testing pattern caught at one Stripe merchant trains the fraud model for the other ~50,000 merchants on the network minutes later. A redlining edge case a Cleary lawyer flags inside Legora becomes a precedent the platform applies for every other AmLaw firm on the platform. The marginal user uplifts every other user — the classic network-effect shape, but mediated through the model rather than through social graph or marketplace liquidity.
- Reusable infrastructure across personas. The same harness — auth, audit trail, eval suite, retrieval layer — that serves general counsel can be re-pointed at compliance officers, paralegals, or in-house contract managers without rebuilding the substrate. Procore (the construction project-management platform) did this by extending from general contractors to subcontractors to architects on the same data backbone. Toast (restaurant POS and payments) did it by stacking POS → payroll → small-business loans on top of the same payment graph. Each new persona is incremental margin, not incremental rebuild.
- Per-segment tailoring → stickiness. With enough usage, you can fine-tune or fork the system per customer segment — sometimes per individual customer — and ship behavior they can’t get from a generic platform. Harvey’s firm-specific embeddings, Legora’s per-firm playbooks, Bloomberg’s 90+ proprietary models tuned for different desk types. The longer a customer is on the platform, the more bespoke the product becomes for them, and the higher the cost of switching to anything off-the-shelf. Stickiness isn’t a feature you sell — it’s a side effect of compounding.
The combined effect: as adoption grows, the product gets more differentiated, not less. That’s the inverse of platform commoditization. It’s also why vertical SaaS at 1,000 customers is a structurally different business than vertical SaaS at 100 — the flywheel doesn’t just spin faster, it produces different outputs at scale.
TCO Reduction Through the Flywheel
The flywheel also creates a powerful cost advantage. Organizations start with expensive 70B+ parameter models for complex tasks, then collect successful interactions and curate ground-truth datasets. These feed fine-tuning of smaller models (1B–8B parameters). The result: a customized 1B model on proprietary data achieves ~96% accuracy of a 70B model on the same task — at a fraction of the inference cost.
The Modern Data Moat Has Evolved
Static data moats are vulnerable — LLMs can generate synthetic approximations. True defensibility comes from owning the software interfaces where data is naturally and continuously generated. Capital markets are pricing this in: AI-native vertical SaaS is seeing a 65% funding surge toward “invisible AI” — embedded intelligence that lives inside existing workflows, not replacement chatbot interfaces.23
What does this look like in practice? The strongest evidence comes from two industries where data density is highest: finance and retail.
Where the Thesis Already Plays Out: Finance and Retail
Finance and retail are the cleanest proof grounds for the harness + flywheel thesis. Both have dense proprietary data, high stakes, deep workflow integration — and live deployments at scale. Three takeaways across the two industries:
Across 2025–2026 the biggest names in finance shipped their own foundation models — Bloomberg, Stripe, Revolut. The interesting fact isn’t that they built them. It’s that none of them are betting on the model as the moat. The data feeding the model is, and that data only exists because of pre-existing network effects and workflow lock-in:
Bloomberg
2023BloombergGPT (50B params)
$10M training cost on 369B tokens of proprietary financial data. GPT-4 outperformed it on most financial tasks.
Per-user/year pricing with near-zero churn. Decades of curated data, 90+ proprietary models, and embedded workflow the LLM is a retention feature inside.
Stripe
May 2025Payments Foundation Model
Single transformer trained on tens of billions of payment events. ~50K new transactions per minute become training signal.
Annual payment volume through Stripe in 2024. The architecture is buildable by anyone with a GPU budget; the corpus is not.
Revolut
April 2026PRAGMA (encoder family)
Trained on ~40B banking events / 207B tokens from ~25M users. Powers AIR, the in-app assistant now rolling out to 13M UK customers.
Transactions, app navigation, trading, push interactions into one user-level embedding. Cross-surface fusion only works if you own every surface.
Same pattern, three verticals. Foundation models are outputs of these network effects — they're not what creates them. The moat is upstream of the model.
The shape of the moat is consistent across the three cases12425: a model that’s expensive but replicable, sitting on top of a corpus that isn’t. And the corpus is a byproduct — of decades of Terminal workflow, of 1.3% of global GDP flowing through Stripe, of every interaction a Revolut user has with the app. That’s the live version of the flywheel argument from the previous section.
Retail tells the same story with different actors and a similar cadence — the flywheels here aren’t running on hype cycles, they’re running on transactions. Amazon Rufus crossed 300M+ users with interactions up 210% YoY, projecting $12B in incremental annualized sales on top of the recommendation engine that already drives 35% of total sales (~$70B/year). Shopify runs $1.1T GMV (>12% of US e-commerce) and uses SimGym to generate AI shopper personas from billions of real transactions, while Shopify Catalog runs AI-inferred categories across billions of products. Klarna’s Agentic Product Protocol covers 100M+ items, 400M prices, 12 markets through Stripe SPTs — every agent query feeds back into fraud, pricing, and recommendation models that no competitor can see. McKinsey projects agentic commerce will hit $3–5T globally by 203026; the companies positioned to capture it are the ones with the same shape of moat, not the ones with the best foundation model.
The clearest signal that this is working past the demo stage comes from the most risk-averse buyers on earth. NBIM, Norway’s $1.7T sovereign wealth fund, deployed Claude across its investment teams and reported ~20% productivity gains — equivalent to 213,000 analyst hours per year. AIG rebuilt its underwriting review workflow around an LLM harness and saw cycle times go 5× faster, with accuracy improving from 75% to 90%+. Neither of these is a pilot or a vendor case study. The gains came from harness engineering — workflow integration, evals, human-in-the-loop verification — not from the underlying model. The broader inflection backs them up: 72% of enterprises have moved from AI trials to production, with 40% projected to have task-specific agents by end of 2026 (vs less than 5% in 2025).
— Two Sigma •2026 OutlookThe next year won’t be about LLMs making trades. It will be about AI becoming the operating system for how quant research and investing actually work.
Why Vertical SaaS Is Still a Play
Despite the platform giants’ land grab, vertical SaaS isn’t just surviving — it’s thriving. The reasons are structural, not sentimental. The radar below scores four contenders — AI-native vertical SaaS, OpenAI/Anthropic, Mistral, and the generic thin-wrapper startup — across the six dimensions that actually decide who wins a vertical: data depth, workflow lock-in, regulatory depth, domain expertise, switching cost, and distribution reach. Hover any point for the reasoning behind the score:
Where Each Contender Actually Wins
Six dimensions, four contenders. Scores are estimated 1–10 based on current product depth, customer evidence, and the patterns covered above.
Contenders
Hover any point on the chart for the score and reasoning. Hover an axis label for the dimension definition.
Horizontal platforms (OpenAI, Anthropic, Mistral) dominate Distribution Reach but score thin on Workflow Lock-in, Regulatory Depth, and Switching Cost. Thin AI wrappers lose almost everywhere — there's nowhere defensible to stand. AI-native vertical SaaS occupies the one position the platforms structurally cannot reach: deep data, deep workflow, deep regulation, high switching cost. That's where the moat lives.
A quick orientation on the names that follow: Procore runs construction project management for general contractors and subs; Toast is the dominant restaurant POS, payments, and payroll platform; ServiceTitan does the same for home-services contractors (HVAC, plumbing, electrical); Veeva is the life-sciences cloud powering pharma CRM and clinical trials. None of them sell “AI” as a product. All of them sit on a decade of workflow data their customers can’t get anywhere else.
The data is the moat, not the feature. A general AI legal assistant sees ~15% churn — it’s too broad to be indispensable. A vertical SaaS tool for medical malpractice discovery commands 300% higher pricing with less than 3% churn. The difference is data depth and workflow specificity.
Vertical players own the operations layer, not just the software. Leaving Procore means ripping out field coordination, compliance documentation, and bid management simultaneously. Toast and ServiceTitan don’t just sell software — they process payments, payroll, and merchant accounts. LLMs can’t settle payments, underwrite loans, or interact with banking infrastructure. The AI output is a byproduct of deep integration — and the integration is what creates the switching cost. Revenue flows through transaction volume, making these businesses immune to seat-based compression.
Capital and history both agree. In 2025 verticals captured 53% of deal volume and 30% of capital ($186B) — 51% excluding mega-rounds like OpenAI and Anthropic. AI-native vertical SaaS is seeing a 65% increase in capital flow into 2026, with Series A medians at $22M vs $15M traditional (47% larger rounds). And the historical parallel is intact: despite AWS, Salesforce, and Azure dominating horizontal cloud, vertical SaaS produced Veeva (~$37B peak), Toast, Procore, and ServiceTitan. The top 20 public vertical SaaS companies hold ~$300B in combined market cap.
— George Kurtz •CrowdStrike CEOAs cloud was maturing, I heard a lot about the hyperscalers actually providing all the security services. Well, that didn’t happen.
Cognitive Debt: The New Technical Debt
Everything to this point has been about what happens when you do this right — the harness compounds, the flywheel turns, the moat deepens. This section is about what happens when you don’t. The downside of skipping harness engineering isn’t just slower compounding; it’s a category of debt that accumulates in your product and your team in ways that are genuinely hard to pay off later.
Deploy agents without a harness and you don’t just get worse output — you accumulate cognitive debt, the AI-era successor to technical debt. Same dynamic: short-term shortcuts compound into long-term liabilities. It compounds silently and manifests in three places that map directly back to the harness components from earlier:
- Chronic context loss across interactions — the Memory & Context layer of the harness diagram, but missing. Sessions don’t carry state, retrieval is shallow, the agent rediscovers the same things every turn. Symptoms: users repeating themselves, agents forgetting prior decisions, every conversation starting from zero. The fix is the harness’s tiered memory system, not a longer context window.
- Uncontrolled and hallucinated tool usage — the Reasoning Orchestrator is absent, so the LLM’s probabilistic plans execute directly against real systems. Agents call APIs they shouldn’t, write data they shouldn’t, retry destructive operations without idempotency. The R.E.S.T. framework’s Reliability and Security pillars exist precisely to prevent this, and METR’s testing shows even strong models fail here at high rates without a proper orchestration layer in front of them.
- Unreliable agent behavior that erodes user trust — the Verification Loop is missing, so the same class of failure keeps shipping. No eval suite means no regression detection, which means every deployment is a roll of the dice. This is the loop NGP Capital and Y Combinator both call the actual moat — and the absence of it is what the Princeton HAL data flagged across 21,730 rollouts.
— Randall HuntIf you don’t engineer the harness, you don’t get compounding leverage; you get compounding cognitive debt.
The damage extends to human capital too. Junior employees are deskilled by unharnessed AI — they learn to accept outputs without understanding them, never build the judgment that lets them catch when the model is steering them wrong. A well-engineered harness forces AI to show its work, cite sources, and pause for human-in-the-loop verification (the same HITL loop that feeds your data flywheel, incidentally — the upside and the safety mechanism are the same system). The harness doesn’t just protect your product; it protects your team’s ability to think. And once cognitive debt is established at either layer, paying it down looks a lot like a rewrite — which is the part of the analogy with technical debt that hurts most.
The harness, the flywheel, the vertical SaaS moat, the avoidance of cognitive debt — they’re not four separate strategies. They’re the same architecture, viewed from four angles. The next section pulls them into a single playbook.
The Playbook: Building for the Long Game
Everything in this post distills to five principles for builders. Each one is the practical compression of an argument from earlier — the harness, the flywheel, the network effects, the vertical-SaaS moat, the cognitive-debt avoidance. Together they’re not a list of nice-to-haves; they’re the architecture for a product that gets harder to compete with every quarter you operate it.
Own the workflow, not the output. The model produces tokens; the workflow produces switching cost. A general AI legal assistant churns at ~15%; a workflow-embedded vertical tool churns at less than 3% and charges 300% more for the same underlying reasoning capacity. Pick a domain with dense workflow data and regulatory complexity — logistics, manufacturing, real estate, healthcare — and integrate where ripping you out means ripping out the work itself. The output is the byproduct. The integration is the moat.
Build the harness first. Decouple probabilistic reasoning from deterministic execution from day one. That means: a real orchestration layer with policy gateways, a memory system tiered for both session and long-term context, an eval suite that gates every deployment, observability that catches drift before users do, and tool dispatch with idempotency and least-privilege scoping. The R.E.S.T. pillars — Reliability, Efficiency, Security, Traceability — are the minimum bar. Build them model-agnostic so when the next 40x price drop lands, you swap the model and keep everything that creates value.
Spin the flywheel from day one. Every interaction has to produce proprietary data — corrections from domain experts, edge cases caught by HITL, traces from production. Wire the data collection into the first version of the product, not the third. The companies that win at 24 months are the ones who started capturing signal at month one. NBIM saw 213,000 analyst hours of productivity gain not because Claude got smarter, but because the harness around it captured every correction the analysts made and the flywheel turned them into a better-tuned system over time.
Compete on integration, not intelligence. Let model labs race on benchmarks. Your fight is happening on a different axis: how deeply you integrate with the systems your customers already use (EHRs, iManage, payment rails, ledger systems), how well you handle the regulatory specificity of each jurisdiction (AML fines were up 417% in H1 202527 — regulatory complexity is structural advantage for data-rich incumbents), and how much of the operations stack you cover. Cursor reaches $50B on the same models as everyone else. Harvey and Legora both crossed $100M+ ARR while a model-provider plugin shipped against them in the same quarter. That’s the bet.
Price for outcomes, not seats. The value your product creates lives in the labor line of your customer’s P&L — analyst hours replaced at NBIM, underwriting cycles compressed 5× at AIG, fraud recovered to the tune of $917M at Stripe BFCM. Price against that, not against logins. Outcome-based pricing aligns your revenue with the flywheel: more usage means better outcomes means more revenue means more usage. Seat pricing decouples the two and caps your upside at exactly the point you most need it uncapped.
The Builder's Playbook
Five principles, with what to do, what to avoid, and how it's already playing out in the wild. Copy the checklist below before your next planning meeting.
Own the workflow, not the output
The model produces tokens. The workflow produces switching cost.
Do. Embed where ripping you out means ripping out the work — EHR, CRM, checkout, document workflow.
Don't. Ship a chat UI on top of someone else's system of record and call it a product.
Own the workflow, not the output
Procore in construction. Toast in restaurants. Veeva in life sciences. 300% pricing, under 3% churn.
Conclusion
The model layer is following classic commoditization: initial scarcity, rapid proliferation, price collapse, value migration. We’ve seen this movie before with CPUs, cloud compute, and storage. The plot doesn’t change — only the actors.
We opened with a question. If the model is the CPU, what is the actual computer? After running through the harness, the flywheel, finance, retail, and vertical SaaS, the answer takes a clearer shape:
- The motherboard is the harness — the deterministic runtime that wires the CPU into everything else, dispatches tools, manages context, and enforces the policy gateway between probabilistic reasoning and the real world.
- The memory is the data flywheel — short-term context, long-term episodic stores, and the proprietary behavioral graph that compounds with every interaction.
- The storage is the workflow integration layer — the EHR connections, the iManage hooks, the payment rails, the firm-specific playbooks that accumulate over years and can’t be ripped out without ripping out the work itself.
- The I/O controllers are the evals, guardrails, and observability — what turns a stochastic model into a system you can actually put in front of a regulated buyer.
And the architecture question — how do you build it for durable competitive advantage? — the post’s answer is the playbook above: own the workflow, build the harness first, spin the flywheel from day one, compete on integration, price for outcomes. None of those steps is about the model.
Platforms entering verticals face the same constraint: they ship the CPU, sometimes the motherboard, occasionally a reference design. They cannot ship decades of memory, storage, and I/O specific to your customers. They’ll capture the easy middle, but the edges — where the real money lives — belong to the companies that build the rest of the computer.
The real competition isn’t who has the best model. It’s who builds the system that gets smarter every day.
The best AI companies won’t have the best models. They’ll have systems that get smarter every day — because the harness captures what the model can’t.
References
Footnotes
-
Bloomberg, “BloombergGPT: A Large Language Model for Finance,” 2023; Bloomberg Terminal pricing and subscriber data, public filings and industry analysis 2024–2025; comparative benchmarks vs. GPT-4 on financial tasks ↩ ↩2
-
Epoch AI, “AI Price Performance Trends,” 2025-2026 ↩
-
Enterprise AI Adoption Survey, 2025 ↩
-
Gartner, “AI Infrastructure Cost Projections,” March 2025 ↩
-
Menlo Ventures, “Enterprise API Market Share Analysis,” Mid-2025 ↩
-
Sam Altman, OpenAI CEO, Public Statement, 2025 ↩
-
Satya Nadella, Microsoft CEO, Earnings Call, 2025 ↩
-
Nandan Nilekani, Infosys Co-Founder, Industry Conference, 2025 ↩
-
Anthropic, “Claude for Legal” repository (github.com/anthropics/claude-for-legal), May 12, 2026; 12 practice-area plugins, 80+ agents, ~20 MCP connectors, Managed Agents API. Coverage: Artificial Lawyer, LawSites, TechCrunch, Legal IT Insider, May 2026 ↩
-
TRAE (@Trae_ai), “AI Agent = SOTA Model (Wild Horse) + Harness (Control System) = An Elite Performer,” X, April 23, 2026 ↩ ↩2 ↩3 ↩4
-
Will Chen, “Mike — open-source legal AI platform” (mikeoss.com / github.com/willchen96/mike), May 2026; AGPL v3 license, BYO-API-key (Claude/Gemini/OpenAI). Coverage: Legal Futures, Legal IT Insider, Artificial Lawyer, Hacker News, May 2026 — feature-parity estimates from tester reports vs. Harvey/Legora ↩
-
Omar Khattab et al., “DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines,” arXiv:2310.03714, ICLR 2024 — Stanford NLP ↩
-
Matei Zaharia et al., “The Shift from Models to Compound AI Systems,” Berkeley BAIR Lab, February 2024 ↩ ↩2
-
Addy Osmani, “Agent = Model + Harness,” X / personal blog, May 2026 ↩
-
OpenDev Team, “OpenDev: A Terminal-Native Interactive Coding Agent,” arXiv:2603.05344, 2026 ↩
-
D. Sculley et al., “Hidden Technical Debt in Machine Learning Systems,” NeurIPS 2015 (papers.nips.cc/paper/5656) ↩
-
Harrison Chase, “What is a ‘cognitive architecture’?” LangChain Blog, 2024; term attributed to Flo Crivello. See also Sumers et al., “Cognitive Architectures for Language Agents,” arXiv:2309.02427, 2023 ↩
-
Terminal Bench 2.0 Leaderboard, 2026 — Claude Opus 4.6 cross-harness comparison ↩
-
Princeton HAL Leaderboard, 21,730-rollout study across 9 models, 2025-2026 ↩
-
Aparna Dhinakaran (Arize AI), “Cursor, Claude Code, Windsurf, and Codex Are All Harnesses,” X, April 22, 2026 ↩
-
Boris Cherny (Anthropic) vs. Jerry Liu (LlamaIndex), public debate; METR coding agent evaluation, 2025-2026 ↩
-
Hugging Face FineWeb team, “The Synthetic Data Playbook: Generating Trillions of the Finest Tokens” (huggingface.co/spaces/HuggingFaceFW/finephrase); see also Cosmopedia (huggingface.co/blog/cosmopedia), 25B synthetic tokens for pre-training via Mixtral-8x7B ↩
-
Capital flow analysis, AI-native vertical SaaS funding surge toward “invisible AI” (embedded intelligence), 2025–2026 ↩
-
Stripe, “Payments Foundation Model” announcement, May 2025; TechCrunch coverage and Cognitive Revolution interview with Emily Sands on transformer-based PFM trained on tens of billions of transactions, ~50,000 new transactions per minute, $1.4T annual volume ↩
-
Linas et al., “PRAGMA: Revolut Foundation Model,” arXiv:2604.08649, 2026; in-app AIR assistant rollout to 13M UK customers, April 9 2026 — encoder transformer family trained on ~40B banking events / 207B tokens from ~25M users ↩
-
McKinsey, “Agentic Commerce: Global Value Projection by 2030,” 2025-2026 ↩
-
AML/financial compliance fines analysis, H1 2025 — 417% YoY increase ↩
Was this helpful?
Let me know what you think!