"AI struggles with long-term reasoning and real-world reliability." — A common critique of current AI implementations, 2025

I would observe that many humans also struggle with consistent long-term reasoning and real-world reliability.

This isn't necessarily a defense of AI. It's an observation about cognition itself — one that highlights why many AI implementations encounter difficulties, and why certain approaches consistently yield remarkable results.

The 4-Hour Ceiling

Cal Newport discusses "Deep Work." The Draugiem Group refers to "Elite Performers." Gretchen Rubin explores various work styles. The research across these areas consistently points to a discernible limit on how much high-quality cognitive work a human can effectively perform in a single day.

According to insights derived from research by Anders Ericsson, often popularized in books like "Outliers," intense deliberate practice is typically limited to approximately four hours daily. Similarly, Stanford's John Pencavel found that productivity tends to decline significantly after 55 hours per week of work — with workers putting in 70 hours often achieving similar output to those working 55. The Draugiem Group's studies revealed that their top 10% of performers, considered elite, often follow a specific pattern: they engage in intense work for about 52 minutes, then take 17 minutes off. Notably, these high-performing individuals frequently work fewer total hours than average.

From personal experience, I've found that on my most productive days, I'm capable of about four hours of genuine, focused reasoning before needing a break to reset and clear my head. Sustaining that level of intense work often means little other meaningful work gets done for the remainder of the day, and recovery time is needed.

This approach of focused, intermittent effort can lead to significantly higher output; in my past roles, it produced roughly three times the output of colleagues adhering to more continuous work models. The efficiency gains stem from this constrained, focused approach, which isn't simply transferable to extended, unbroken periods.

Two Approaches to Work

Gretchen Rubin's research explores various approaches to tasks. These can be broadly categorized as steady-paced (often likened to "Marathoners"), intense bursts with extended recovery (similar to "Sprinters"), and those who struggle with consistent execution ("Procrastinators"). This distinction is valuable: individuals who adopt the "Sprinter" pattern often find that pressure clarifies their thinking and compresses output, leading to a sense of accomplishment.

However, a cultural tension often arises: those who favor continuous work models may view intermittent high-intensity practitioners as less committed, while intermittent high-intensity practitioners may see continuous work models as inefficient. In many corporate environments, the continuous work model is often the default, where presence is equated with commitment, and hours with effort.

Research indicates that a significant percentage of workers report being overlooked for career advancements because their skills or work behaviors were misunderstood. This perception gap highlights a real challenge in valuing diverse approaches to productivity.

For the remainder of this piece, I'll use some shorthand. The prevailing culture that often equates hours with output, presence with commitment, and busyness with productivity, I'll refer to as "Traditional Continuous Workers." Conversely, the elite performers, the individuals who understand and leverage burst-mode cognition, I'll call them "Burst-Mode Practitioners."

The AI Parallel

Now, let's consider this in the context of AI: AI agents frequently encounter a compounding error problem. Research demonstrates that even a modest 1% error rate per step can escalate into a substantial 63% failure rate on a 100-step task. Similarly, a 95% accuracy rate per turn might yield only a 77% success rate after five turns. Prominent figures like Demis Hassabis of DeepMind have warned that by 5,000 steps, such compounded errors can render the final answer essentially random.

Those who maintain skepticism about AI often highlight this issue, asserting, "AI simply cannot handle long-term reasoning."

However, this perspective may overlook a crucial point: neither can humans effectively "marathon" their way through complex cognitive tasks without errors. Our natural approach involves bursts of effort, followed by resets. We break down problems into manageable segments. We take breaks. We allow time for integration and fresh perspectives. It's crucial to note, however, that this human analogy focuses on the process of effective work, not on equating AI's current 'understanding' with the deep common sense or flexible analogical reasoning inherent in human cognition.

Traditional Continuous Workers are building AI workflows the way they work — marathon sessions, context stuffing, no deliberate resets. They're optimizing for time in seat, not quality of reasoning cycles. And then they're surprised when the AI degrades. They are, in essence, demanding a level of continuous, error-free performance from AI that doesn't consistently exist even in human cognition.

The Proof

In late 2025, a paper called MAKER demonstrated something remarkable: the first system capable of completing over one million LLM steps with zero errors. This wasn't achieved through marginal improvements in accuracy but through a fundamentally different approach.

How was this accomplished? Through "extreme decomposition + error correction at each step," a methodology termed "massively decomposed agentic processes" (MDAPs). Instead of relying on more powerful models attempting to grind through complex, monolithic tasks, the MAKER system utilized simpler models coupled with rigorous auditing and verification at every step. This approach mirrors the Burst-Mode Practitioner's strategy, applying it to AI. It involves breaking down tasks into atomic units, verifying outcomes at each boundary, and deliberately resetting context to allow simple, focused execution to compound into complex, reliable outcomes.

While groundbreaking, it's fair to acknowledge that the feasibility of such extreme decomposition can be challenging for all complex, real-world problems. Not every task lends itself to granular breakdown without potentially losing critical holistic context or introducing artificial constraints. The MAKER paper's 'zero errors' benchmark, while powerful, might not be universally applicable to domains where task definition is highly fluid or subjective.

Indeed, I am currently writing a book myself, producing it using custom code on my computer with a multistep, self-correcting swarm of agents, each with their own role—a direct application of these principles.

I also have a colleague who applies this religiously in his work. He uses Gemini Flash — often one of the fastest and most cost-effective models available — for 100% of his coding tasks. He opts for Flash over more powerful models like Opus or GPT-4. This is because he has meticulously crafted a persona and workflow that compels the model to simplify problems and plan extensively before generating code. With proper decomposition and verification at critical junctures, the perceived necessity for "smarter" or larger foundational models diminishes significantly.

When I've suggested this pattern of auditing at every step in my own work contexts, it has sometimes been met with critiques of being "inefficient" or involving "too much overhead." The prevailing sentiment can often be to simply "let the AI run." This critique, while often short-sighted, does touch upon valid considerations regarding the practicalities of MDAPs. Implementing robust 'auditors at every step' introduces its own significant challenges: If these auditors are AI, how is their inherent reliability or high accuracy guaranteed, especially for subjective or ambiguous steps? The reliability problem may merely shift from the primary AI model to the auditing AI, raising the question of "Who audits the auditor?" without a clear solution. Beyond the immediate token spend, there's substantial engineering complexity, upfront development cost, and ongoing maintenance involved in building and orchestrating such generalizable auditing and massively decomposed agentic processes, particularly in dynamic systems. These systems demand meticulous orchestration, state synchronization, context propagation, rigorous debugging, and clear error definitions to guard against unintended emergent behaviors. Moreover, defining what constitutes an 'error' itself can become ambiguous in highly subjective or open-ended tasks, where locally 'error-free' steps might still contribute to a globally suboptimal or misaligned outcome.—Stephen's proposed solution to this is in his upcoming whitepaper "The Radical Simplicity Method".

Yet, one must ask: what does "inefficient" truly mean when the alternative is compounding errors into guaranteed project failure? The mathematics are often clear: even a two-fold increase in cost per step for 90%+ accuracy is a far superior outcome to a lower cost per step with a mere 37% accuracy. The ultimate optimization target isn't merely token expenditure; it is the achievement of correct and reliable outcomes.

The Nuanced Truth

Those who focus on AI's current limitations are indeed correct on one point: present-day AI implementations can be unreliable for complex, long-horizon tasks. However, their diagnosis of the underlying cause is often mistaken.

The fundamental limitation may not lie in AI's inherent intelligence. Rather, it often resides in how AI is deployed — frequently by individuals who may not fully grasp the principles of burst-mode cognition because they haven't consistently applied such patterns in their own high-performance work.

Traditional Continuous Workers tend to design AI workflows in a manner consistent with their own work habits: lengthy, continuous sessions, excessive context stuffing, and an absence of deliberate resets. When the AI's performance inevitably degrades under these conditions, the technology itself is often blamed.

Conversely, Burst-Mode Practitioners — including high-output developers with decades of experience in systems integration — consistently achieve results with AI that can appear almost impossible to others. This is not necessarily because they possess access to superior AI models. It is because they deeply understand and apply how high-performance cognition truly functions.

They consciously build deliberate context checkpoints into AI workflows. They ensure clean handoff artifacts between processing sessions. They implement robust auditing mechanisms at every critical step. Their architectural preference leans towards burst-mode processing rather than continuous, uninterrupted streams. Fundamentally, they treat AI with the same strategic intensity, focused effort, and necessary resets that they apply to their own most demanding cognitive tasks.

The Outlook

Critics who highlight AI's limitations may continue to encounter disappointment if their expectations remain rooted in demands for marathon-like, continuous performance from a technology that — much like human cognition — often functions most effectively in strategic bursts coupled with verification.

Those who adhere to Traditional Continuous Worker methodologies may continue to build AI systems prone to failure, often attributing blame to the models, vendors, or perceived hype surrounding the technology.

Meanwhile, a smaller, focused group of Burst-Mode Practitioners will likely continue to achieve seemingly magical results. This is largely because they have already internalized and applied the very principles that research like the MAKER paper has now rigorously demonstrated with million-step benchmarks.

The core message remains clear: the constraint is often the feature. The reset is a fundamental strategy. And intelligent decomposition is a key form of intelligence itself.

AI may not require a 24-hour break, but it undeniably benefits from deliberate resets after periods of uninterrupted processing. Just like you do. Just like every elite human performer does.

The pivotal question is not whether AI is capable of long-term reasoning. It is whether the individuals designing and implementing AI systems truly comprehend the underlying mechanisms of effective reasoning — both human and artificial. Many are still learning.


Stephen has spent 25+ years in systems integration across Financial Services, Oil & Gas, Insurance, Automotive, and Technology sectors. He specializes in fixing projects that larger consulting firms couldn't complete. He operates through D.E. Consulting and Research Inc., and is currently writing a book, which he is producing using custom code on his computer with a multistep, self-correcting swarm of agents, each with their own role.