Dispatch · AI & Autonomy

The Hard Horizon


AI agents can now work on their own for hours, and the length of task they can finish is doubling every few months. But a wave of brutal new benchmarks just measured exactly how far reliable autonomy still is.

June 26, 2026 Lisa Pedrosa 10 min read AI & Agents
full autonomy task length over time →

Ask a frontier AI agent to fix a bug in a small program and it will likely do it before you finish your coffee. Ask it to build a working Slack clone from scratch, or rewrite a sprawling machine-learning codebase from one framework into another, or implement a C compiler over the course of a billion tokens of sustained effort — and something else happens. It starts strong, drifts, contradicts itself, forgets what it decided an hour ago, and quietly falls apart. In June 2026, researchers finally built the benchmarks to measure that collapse precisely. The results are the most honest picture yet of where autonomous AI actually stands.

The story of the past year was supposed to be the arrival of the agent: AI that stops answering questions and starts doing work. By most accounts it arrived. Coding agents now write large fractions of production software at major labs. The connective tissue that lets agents use tools has been installed tens of millions of times. The single most-cited statistic in agent research is also the most optimistic: the length of task an AI can complete on its own has been doubling roughly every seven months, and recent measurements put the doubling time closer to 196 days. Extend that line and full autonomy looks inevitable, and soon. The new benchmarks were built to test whether the line tells the truth.

~5 hrs
Longest stretch a frontier agent can work autonomously on some tasks
196 days
Time for autonomous task length to double
1B
Token budget in the hardest marathon benchmarks
2.6%
Average pass rate on the hardest benchmark tiers

The metric that made everyone optimistic

The doubling curve comes from a now-famous way of measuring AI progress: instead of asking whether a model can do a task, ask how long a task it can complete with the reliability a human would. Score a model not in accuracy but in time horizon — the duration of work it can carry on its own before it needs a person to step in. Measured this way, frontier systems have climbed from tasks that take a person seconds, to minutes, to a few hours, in a remarkably straight exponential line.

That framing is powerful because it converts a fuzzy question into a forecast. If the horizon doubles every six or seven months, then a model that handles a four-hour task today handles a workweek's worth in a couple of years and a month-long project not long after. The era of long-horizon agents, the optimists say, is here. And on the narrow terms of the metric, they are not wrong. What the metric hides is the texture of failure — the difference between a task that takes a long time and a task that demands sustained coherence the whole way through.

"A task that takes five hours and a task that requires five hours of unbroken judgment are not the same problem. Agents have conquered the first and barely touched the second."
The distinction the marathon benchmarks were built to expose

The marathons arrive

June 2026 marked a deliberate shift in how the field tests its agents. Out went the quick, single-shot puzzles; in came the marathons. SWE-Marathon asks coding agents to remain coherent across colossal projects on token budgets reaching a billion — building entire applications, porting massive codebases between frameworks, implementing compilers from first principles. SentinelBench targets long-running monitoring agents, the kind meant to watch a system for hours or days and act only when something is genuinely wrong. LeanMarathon pushes agents to formalize mathematics in the Lean proof language over extended runs, where a single misstep compounds into pages of invalid reasoning.

These are not trick questions. They are scaled-up versions of exactly the work agents are being sold to do. And on the hardest tiers, mainstream agent harnesses paired with frontier model backbones achieve an average full pass rate of just 2.6 percent. Not 26 percent. Two point six. The agents that breeze through a contained bug fix are, on a real long-horizon job, almost always failing somewhere along the way — and a long task that fails at the ninety-percent mark has failed completely.

The benchmarks did not show that agents are useless. They showed that the gap between "can work for five hours" and "can be trusted for five hours" is enormous — and that the second number is the one that matters for autonomy.

Why coherence is the real wall

The reason long tasks break agents is not that the model runs out of capability. It is that it runs out of coherence. A long-horizon job is a chain of thousands of small decisions, each depending on the ones before. A human carries the thread of intent — what we are building, why, what we already ruled out — almost effortlessly. An agent carries it in a finite context window and a fallible memory, and small errors do not cancel out. They accumulate. A wrong assumption made in the first hour quietly poisons the third. Researchers describe the phenomenon as the "hard horizon": the point past which adding more time does not add more progress, because the system can no longer hold its own work together.

This is why the optimistic doubling curve and the grim 2.6 percent can both be true. The curve measures the length of task an agent can sometimes complete. The marathons measure whether it can reliably complete the kind of task businesses actually want automated. One is a record of best-case reach; the other is a record of dependable grip. Progress on the first has been genuine and fast. Progress on the second is the thing the whole field is now, finally, measuring.

100% 50% 0% short bug fix multi-step marathon hardest tier high ~70% low 2.6% reliable completion rate by task length (illustrative)
Reach falls off a cliff as tasks lengthen — the shape of the hard horizon.

The economics depend on the second number

None of this means the agent revolution is a mirage. The economic case is real: analysts at McKinsey estimate AI-driven automation could unlock between $2.6 and $4.4 trillion in annual value, and an agent that can run ten or a hundred tasks a day at trivial marginal cost genuinely changes what is worth doing at all. But almost all of that value lives behind reliability, not reach. A coding agent that finishes nine projects flawlessly and silently corrupts the tenth is not a nine-tenths solution; it is a liability that requires a human to check everything, which is most of the cost the automation was supposed to remove.

This is the quiet recalibration underway in mid-2026. The first wave of agent enthusiasm treated long-horizon autonomy as a capability that had effectively arrived. The marathon benchmarks reframe it as a reliability problem that is still mostly unsolved — and reliability problems are often harder, and slower, than capability problems. They are not fixed by a bigger model alone. They need better memory, better self-correction, better ways for an agent to notice it has lost the thread and recover, and better scaffolding around the model rather than just inside it.

"Full autopilot is not here. Long-horizon agents are. The distance between those two sentences is the whole story of 2026."
On the state of autonomous AI

What to watch from here

The honest read is also the most interesting one. We are not at the end of the agent story, and we are not as far along as a straight line through the doubling curve suggests. The reach is climbing fast and the grip is lagging behind, and the new benchmarks exist precisely so the field can watch the gap close — or refuse to. If the 2.6 percent on the hardest marathons rises sharply over the next year, the optimists are right and full autonomy really is a matter of scaling. If it stays stubbornly low while the doubling curve keeps climbing, it will mean we have been measuring the wrong thing, and the real frontier was never how long an agent could run but whether it could be trusted for the duration.

Either way, the hype phase is over and the measurement phase has begun. That is usually the moment a technology stops being a promise and starts becoming an engineering discipline. The agents are getting longer. The work now is to make the long ones true.

Sources

  1. Medium / evoailabs — The Hard Horizon: Why Frontier AI Agents Are Stalling on Real-World Workflows
  2. DEV Community — Long-Horizon Agents Are Here. Full Autopilot Isn't
  3. Prosus — State of AI Agents 2026: Autonomy is Here
  4. Cogitx — AI Agents: Complete Overview (2026)
  5. Symphony Solutions — AI Agents in 2026: The Future of Autonomous Software
  6. arXiv — An Economy of AI Agents
  7. arXiv — From Logic Monopoly to Social Contract: Institutional Foundations for Autonomous Agent Economies
  8. arXiv — International AI Safety Report 2026
  9. METR — research on measuring AI ability via the length of tasks agents can complete (task-horizon doubling)
  10. McKinsey & Company — estimates of annual economic value from AI-driven automation ($2.6T–$4.4T)
Ko-fi Buy me a coffee
Scroll to Top