Production AIArticle

Why most AI pilots fail — and what the exception does differently

The numbers are grim, but the cause is encouraging: pilots rarely die on the model, and almost always on integration, problem selection and the leap to production.

24 June 2026 · 10 min read · Production AI

In short

“95% of AI pilots fail” is the most-cited — and most-misread — AI statistic of 2025. It comes from a preliminary MIT study and is about return on custom AI, not “AI doesn’t work”. Generic tools like ChatGPT actually reach 80%+ adoption. The gap is between pilot and production — in integration, not in the model. That’s bad news if you’re running an experiment, and good news if you build production-first.

What “95% of AI pilots fail” actually means

In late 2025 one figure travelled the world: 95% of AI pilots fail. It comes from The GenAI Divide: State of AI in Business 2025, a preliminary report (v0.1, July 2025) from researchers affiliated with MIT (Project NANDA). The precise finding: 95% of organisations saw no measurable return on custom or enterprise GenAI, and only ~5% of those pilots reached production with demonstrable P&L impact — despite some $30–40 billion in spend.

Read the number carefully, because it is routinely misquoted. It does not mean “AI doesn’t work” and it is not evidence of a bubble. The denominator is custom and enterprise initiatives; generic tools like ChatGPT and Copilot reach 80%+ adoption with broad productivity gains the report deliberately excludes from its P&L measure. The researchers are explicit: the dividing line is set by approach — integration and learning — not by model quality or regulation.

POC purgatory: the gap between pilot and production

The real problem isn’t building a pilot, but the leap that follows. That same MIT data shows an unforgiving funnel for custom GenAI:

60%

evaluate custom GenAI

20%

reach the pilot stage

reach production

Gartner sees the same pattern on a different yardstick: in one survey, on average only 48% of AI projects reach production at all, and the road from prototype to production takes around eight months. Gartner also predicted that at least 30% of GenAI projects would be abandoned after the proof of concept by the end of 2025 — a threshold later Gartner figures say was comfortably exceeded (over 50%). Death happens between pilot and production.

Why pilots die: the ‘learning gap’

So why do they die there? The MIT report is emphatic: “The core barrier to scaling is not infrastructure, regulation, or talent. It is learning.” Most GenAI systems don’t retain feedback, adapt to context, or improve over time. As one interviewee put it: it doesn’t learn from our feedback and repeats the same mistakes. For complex, high-stakes work users therefore prefer humans by 9 to 1.

That the technology itself does work shows in a striking contrast. Only ~40% of companies bought an official AI subscription, yet at over 90% of the companies surveyed, employees use personal AI tools for work every day — the so-called “shadow AI economy”. Informal use often delivers more than the formal initiatives. The uncomfortable conclusion: it isn’t the models that stall, it’s how organisations procure and integrate them.

It’s not the model, it’s the organisation

RAND examined this from the other side, interviewing 65 data scientists and engineers (RR-A2680-1, 2024). The result: 84% pointed to leadership-driven causes as the primary reason AI projects fail. Top of the list is misunderstanding or miscommunicating the problem the AI is meant to solve — which, per RAND, causes more failures than any other factor.

RAND’s five leading causes, in order, are predominantly organisational — not technical:

The problem is misunderstood or miscommunicated.
There is too little data to train the model.
Teams chase the latest technology instead of a real user problem.
Data and deployment infrastructure falls short.
AI is applied to a problem it can’t (yet) handle.

The often-quoted line that “more than 80% of AI projects fail — about twice the rate of non-AI IT projects” is, incidentally, an industry estimate RAND cites, not one it measured itself. Treat it as direction, not exact law. The throughline holds: failure is mostly about problem selection, data and workflow — exactly what an embedded engineer addresses — not about model intelligence.

The cost-and-value trap

Gartner names four recurring reasons a proof of concept collapses: poor data quality, inadequate risk controls, escalating costs and unclear business value. Organisations especially underestimate that cost point: an ambitious effort that aims to transform the business model runs, per Gartner, from $5 million to $20 million, and the costs are less predictable than other technology.

On top of that sits a classic misallocation. Per MIT, around 70% of AI budgets went to sales and marketing — the visible demos — while the higher return sat in duller back-office automation. Start where the noise is loudest rather than where the return is, and you build a pilot never worth scaling.

Hype versus reality: agentic AI and ‘agent washing’

The next wave — agentic AI — risks repeating the same mistake, but bigger. Gartner predicts that over 40% of agentic AI projects will be canceled before the end of 2027, again through escalating costs, unclear value and inadequate risk controls. Gartner also estimates that of the thousands of “agentic” vendors only around 130 are real; the rest do “agent washing”: rebranding existing assistants, RPA and chatbots without genuine agentic capability.

At the same time, adoption is not success. Gartner expects 80%+ of enterprises to use GenAI in 2026 — but “calling an API” is not the same as capturing return. To be fair: Gartner is also optimistic longer term (towards 2028 it expects 15% of day-to-day decisions to be made autonomously and 33% of enterprise software to be agentic). The technology marches on. The question is whether your effort is the exception that reaches production.

What the exception does differently

The ~5% that do make it do a few things consistently differently. They aren’t luck; they’re a method:

Production-first, not pilot-first: a production deployment with measurable KPIs from day one, not an experiment that ‘might’ go live later.
Embedded to get the problem right: the biggest failure cause (RAND) is misunderstanding the problem. Sitting next to the users removes that error at the source.
Integration into the real workflow, with a feedback loop that lets the system learn — exactly the ‘learning gap’ MIT identifies as the breaking point.
Foundations over model: Gartner found successful organisations invest up to 4x more (as a % of revenue) in data quality, governance, AI-ready people and change management.
Start where the return is (often the back office), not where the demo looks best.

That is precisely the Forward Deployed model I work by: embedded in your business, production-first, with human oversight built into the design — no report, no endless pilot, but a working system within 90 days. Not because 90 days is a gimmick, but because most pilots don’t die for lack of time; they die because they were never set up as production systems.

Honestly: what these numbers do and don’t prove

An article about misquoted numbers shouldn’t cherry-pick its own. So, the honesty alongside it:

The MIT report is preliminary (v0.1) and self-published, with a modest, partly self-selected sample — and it advocates agentic AI with memory as the fix, exactly what Project NANDA itself builds. Read it as direction, not a final verdict.
Almost all Gartner figures are predictions, not measured outcomes — and Gartner is simultaneously bearish near-term and bullish longer-term.
‘Buy beats build’ is correlation within a small sample; the report itself warns it doesn’t prove causation.
Success is measured over different windows. A 90-day engagement proves some things immediately; other returns compound only afterwards. I’d rather be honest about that than over-promise.

Where to start — for healthcare and finance

For healthcare and financial-services organisations — where decisions carry weight and users prefer humans by 9 to 1 — all of this applies double. Don’t start with a broad “AI programme”, but with one sharply scoped, measurable problem in the back office where the return is. Get the data and governance in place first (GDPR, and sector rules like the EU AI Act, MDR or DORA where they apply), integrate into the existing workflow, and build human oversight into every decision that matters. One working system that reaches production is worth more than five pilots stuck in POC purgatory.

Have an AI pilot that keeps stalling, or want to start one that actually reaches production? In a Discovery I map out, in 2–4 weeks, which problem is worth solving, whether the data and workflow are ready, and what the path to production looks like — including an honest “no AI here yet” list.

Book a call See the method

Frequently asked questions.

Is it true that 95% of AI projects fail?

The figure comes from a preliminary MIT study (Project NANDA, July 2025) and means something more specific than ‘AI doesn’t work’: 95% of organisations saw no measurable return (P&L impact) on custom or enterprise GenAI. Generic tools like ChatGPT and Copilot, by contrast, hit 80%+ adoption. The problem isn’t the technology — it’s scaling to production.

Why do most AI pilots fail?

Rarely on the model. According to RAND (2024) the leading causes are organisational: the problem is misunderstood or miscommunicated, there is too little or poor-quality data, teams chase the latest technology instead of a real problem, the infrastructure to deploy is missing, or AI is applied to a problem it can’t (yet) handle. MIT adds a ‘learning gap’: systems that don’t learn from feedback and so stall.

What is ‘POC purgatory’?

The stage where a proof of concept gets stuck and never reaches production. In the MIT study, 60% of organisations evaluate custom GenAI, 20% reach a pilot, but only 5% reach production. The leap from pilot to production — not building the pilot — is where it goes wrong.

Is it better to build or buy AI?

MIT found that externally sourced or partnered tools made up ~66% of successful deployments versus ~33% for fully in-house builds — and that partnership pilots were ~2x more likely to reach production. Important: that’s correlation within a small sample, not proof of cause. The lesson isn’t ‘always buy’, it’s: pair external delivery capability with internal context.

How do you stop an AI pilot from stalling?

Build production-first (measurable KPIs from day one), pick one sharply scoped problem, get the data and deployment foundations in place first, integrate into the real workflow with a feedback loop, and keep humans in control of decisions that matter. In short: treat it as a production system, not an experiment.

Sources

All articles