Skip to content
Go back

Analysis: Capabilities, Limitations, and Premature Patterns of AI Agents

Published:  at  11:00 AM
Available Languages:

AI Agent Design Guide

Related articles: AI Agent Design Guide | Building Blocks for AI Agents | 11 Multi-Agent Orchestration Patterns

Single and Multi-Agent Systems Based on LLMs


1. Introduction

1.1 Analysis Objective

This analysis identifies, classifies, and evaluates patterns related to LLM-based AI agents (Large Language Models) that appear promising but whose feasibility remains structurally limited with current architectures. The goal is to provide a critical, falsifiable, and non-speculative framework for evaluating the real capabilities of agentic systems.

1.2 Scope

Included:

Strictly excluded:

1.3 Operational Definitions

AI Agent: LLM-based system capable of perceiving a state, producing local reasoning (token generation), and triggering actions via explicit orchestration (code, API, tools).

Multi-Agent System: Set of agents coordinated by an explicit protocol. Multi-agent does not imply any emergent collective intelligence by default; any appearance of superior coordination comes from the orchestrator or communication protocol.

Scaling: Increase in model parameters, training data, or inference compute.

1.4 Summary of Key Findings

Critical Observations

  1. Scaling does not solve structural limitations: Increasing model size improves factual knowledge and linguistic fluency but does not correct the absence of autonomous planning, causal reasoning, or deep logical understanding (Kambhampati et al., 2024; Valmeekam et al., 2025).
  2. Multi-agent systems fail in 41-87% of cases: The MAST study (Cemri et al., 2025) identifies 14 distinct failure modes, 79% of which stem from specification and coordination problems, not technical infrastructure limitations.
  3. Self-correction without external feedback is illusory: Agents cannot detect their own errors without an external deterministic verifier (compiler, test, oracle). Self-criticism increases confidence without improving accuracy (Huang et al., 2024; Stechly et al., 2024).
  4. Multi-agent does not outperform single-agent on most benchmarks: Performance gains are marginal and often inferior to simple approaches like best-of-N sampling (Kapoor et al., 2024; Wang et al., 2024).
  5. “Emergent capabilities” are metric artifacts: Apparent qualitative jumps during scaling result from non-linear metric choices, not real cognitive phase transitions (Schaeffer et al., 2023).

Viable vs. Premature Patterns

CategoryViabilityExamples
FunctionalTasks where feedback is deterministicUnit tests, code translation, SQL
FragileDependency on specific promptsRAG, self-consistency, ReAct
PrematurePromise without robustness proofUniversal agent, autonomous planning
Structurally impossibleContradiction with architectureSelf-verification, intrinsic causal reasoning

1.5 Design Recommendations

  1. Prefer closed loops: Agentic success requires an external deterministic verifier (compiler, simulator, automated test).
  2. Limit scope per agent: Performant agents operate in narrow, well-defined domains, not as “generalists”.
  3. Treat multi-agent as orchestration, not collective intelligence: The real “locus of decision” is the orchestrator (often Python code), not the agents themselves.
  4. Assume 85-90% as reliability ceiling: For the last 10%, invest in human supervision rather than model augmentation.
  5. Document hidden dependencies: Any “autonomous” system must make explicit its implicit human dependencies (prompt engineering, data selection, validation).

2. SYNTHETIC TABLE OF ANALYZED PATTERNS

Legend of Categories

CodeMeaning
ADemonstrated and robust capability
BConditional / fragile capability
CIdentified structural limitation
DDocumented failure
EPremature promise / overinterpretation

2.1 Planning and Reasoning Patterns

#PatternTypeCategoryVerdict
1Autonomous planningSingleC/DStructurally impossible
2Self-verificationSingleCStructurally impossible
3Causal reasoningSingleCPremature
4Chain-of-ThoughtSingleBFragile
5Automatic backtrackingSingleDStructurally impossible
6Iterative reflectionSingleB/CFragile
7Self-ConsistencySingleBAchievable with reservations
8Tree-of-ThoughtSingleBFragile

2.2 Multi-Agent Patterns

#PatternTypeCategoryVerdict
9Multi-agent debateMultiD/EPremature
10Emergent collective intelligenceMultiEIllusion
11Role specializationMultiB/CFragile
12Autonomous coordinationMultiDPremature
13Self-organizationMultiDStructurally impossible
14Cross-verificationMultiC/DFragile
15Multi-agent consensusMultiDIllusion

2.3 Memory and Context Patterns

#PatternTypeCategoryVerdict
16Autonomous long-term memorySingle/MultiCPremature
17RAG (Retrieval-Augmented)SingleA/BAchievable with reservations
18Extended context (>100k tokens)SingleB/CFragile
19Autonomous memory updateSingleDPremature

2.4 Tools and Execution Patterns

#PatternTypeCategoryVerdict
20Simple tool-useSingleAAchievable
21Sequential tool-use (>3 tools)SingleB/CFragile
22Self-debugging with compilerSingleA/BAchievable
23Code-as-PolicySingleAAchievable
24Unknown tool usageSingleDPremature

2.5 Scaling Patterns

#PatternTypeCategoryVerdict
25Scaling improves reasoning-EOverinterpretation
26Emergence through scaling-EMetric artifact
27Universal agent through scaling-EIllusion
28Reliability through redundancyMultiDFragile

3. DETAILED ANALYSIS OF CRITICAL PATTERNS

3.1 Pattern: Autonomous Planning

Complete Analysis Grid

FieldContent
1. Pattern nameAutonomous Planning
2. TypeSingle-agent
3. Perceived implicit promiseAn LLM agent can decompose a complex objective into sub-steps, schedule these steps, and execute them autonomously until achieving the objective.
4. Underlying technical hypothesisThe model has internalized, through training on human text, sufficient representations of causality and sequential logic to generate valid plans.
5. Necessary conditions(a) Ability to predict action effects, (b) Ability to backtrack when blocked, (c) Maintaining a coherent world model, (d) Distinction between current state and target state.
6. What actually worksThe model can generate action sequences that look like valid plans on domains frequent in training data. The textual form of a plan is often correct.
7. Structural limitations (LLM)LLMs are autoregressive systems with constant time per token. They cannot perform search in a state space. Token generation is not conditioned on logical validity verification.
8. Systemic limitationsNo internal mechanism for plan coherence verification. No explicit representation of action preconditions and effects.
9. Typical failure modesInvalid plans (impossible actions in current state), incomplete plans (forgotten sub-objectives), no recovery on step failure, circular dependencies between steps.
10. Hidden dependenciesThe prompt often contains examples of valid plans (few-shot). Humans implicitly validate plan feasibility. Tested domains are overrepresented in data.
11. Robustness testMinimal prompt: fails. Without human intervention: fails. With noise/ambiguity: fails. Long duration: fails. Multi-instances same model: not applicable.
12. VerdictStructurally impossible — Autonomous planning contradicts the fundamental architecture of autoregressive LLMs.

Reference Publications


3.2 Pattern: Multi-Agent Systems with Debate

Complete Analysis Grid

FieldContent
1. Pattern nameMulti-Agent Debate / Discussion
2. TypeMulti-agent
3. Perceived implicit promiseMultiple LLM agents, by confronting their responses, mutually correct their errors and converge toward a more accurate answer than a single agent.
4. Underlying technical hypothesisResponse diversity + an arbitration mechanism allows filtering individual errors (similar to bagging in ML or majority voting).
5. Necessary conditions(a) Independence of errors between agents, (b) Capacity for constructive criticism, (c) Ability to distinguish a valid argument from a persuasive one, (d) Absence of shared systematic bias.
6. What actually worksDebate improves results only when the correct answer is already “accessible” via training data (distributed memorization).
7. Structural limitations (LLM)All agents use the same model or similar models (homogeneity). Errors are correlated, not independent. Conformity bias pushes agents to align with the first response.
8. Systemic limitationsNo mechanism for objective truth. The most “persuasive” agent (verbose, confident) wins, not the most “correct”. Absence of ground truth prevents convergence toward truth.
9. Typical failure modesConsensus on a false answer (echo chamber). Error amplification through mutual validation. Infinite discussion loops without convergence. “Dominant” agent imposing its answer.
10. Hidden dependenciesThe orchestrator defines debate rules (who speaks when, end criteria). The initial prompt frames the debate. A human often selects the final answer.
11. Robustness testMinimal prompt: fails. Without human intervention: fails. With noise/ambiguity: fails severely. Long duration: degradation. Multi-instances same model: exacerbates problems.
12. VerdictPremature / Illusion — Multi-agent debate does not provide real collective intelligence.

Reference Publications


3.3 Pattern: Self-Correction / Self-Refinement

Complete Analysis Grid

FieldContent
1. Pattern nameSelf-Correction (Self-Refinement, Reflexion)
2. TypeSingle-agent
3. Perceived implicit promiseAn agent can detect its own errors, criticize them, and correct them iteratively until producing a valid answer.
4. Underlying technical hypothesisThe model possesses a “meta-cognitive capacity” allowing it to evaluate the quality of its own outputs.
5. Necessary conditions(a) Error detection capability, (b) Cause diagnosis capability, (c) Appropriate correction generation capability, (d) Reliable stopping criterion.
6. What actually worksSelf-correction works if and only if an external verifier provides usable feedback (e.g., compiler error message, automated test result).
7. Structural limitations (LLM)The model uses the same weights to generate and to critique. Confirmation bias pushes to validate its own answer. No distinct representation of “production” vs “evaluation”.
8. Systemic limitationsWithout external signal, the model has no way to distinguish a correct answer from an incorrect but plausible one. The generated “critique” is itself subject to the same biases.
9. Typical failure modesValidation of a false answer as correct. Changing a correct answer to an incorrect one (sycophancy). Infinite “correction” loops without improvement. Critique without corrective action.
10. Hidden dependenciesThe prompt structure induces the “form” of self-criticism. Few-shot examples show how to critique. Humans often validate the final result.
11. Robustness testMinimal prompt: fails. Without human intervention: fails massively. With noise: degraded performance. Long duration: drift. Multi-instances same model: not applicable.
12. VerdictStructurally impossible without external verifier — Pure self-verification is an illusion.

Reference Publications


3.4 Pattern: Emergent Capabilities through Scaling

Complete Analysis Grid

FieldContent
1. Pattern nameEmergence through Scaling
2. TypeArchitecture (not agent-specific)
3. Perceived implicit promiseBy sufficiently increasing model size (parameters, data, compute), qualitatively new capabilities “emerge” discontinuously.
4. Underlying technical hypothesisCritical complexity thresholds exist beyond which the model acquires reasoning, planning, or understanding capabilities that were previously absent.
5. Necessary conditions(a) Real existence of phase transitions, (b) Observation independence from chosen metrics, (c) Robustness of emerged capabilities.
6. What actually worksScaling improves fluency, factual coverage, stylistic coherence, and reduction of trivial hallucinations. Performance on existing benchmarks increases.
7. Structural limitations (LLM)The observed “jumps” are artifacts of non-linear metrics (e.g., “pass/fail” vs continuous probability). Reasoning capabilities measured by specific benchmarks do not generalize.
8. Systemic limitationsBenchmark contamination (presence in training data) creates an illusion of capability. Scaling does not modify the fundamental architecture (autoregressive, no world model).
9. Typical failure modesRegression on simple variants of “mastered” problems. Fragility to lexical perturbations. Success on benchmark, failure in real conditions.
10. Hidden dependenciesThe benchmark is selected to show the “jump”. Metrics are chosen post-hoc. Comparisons ignore cost (compute, data).
11. Robustness testMinimal prompt: variable. Without human intervention: partial. With perturbations: frequent failure. Long duration: stable on factual recall. Multi-instances same model: not applicable.
12. VerdictOverinterpretation / Metric artifact — “Emergent” capabilities are statistical illusions.

Reference Publications


3.5 Pattern: Autonomous Multi-Agent Coordination

Complete Analysis Grid

FieldContent
1. Pattern nameAutonomous Multi-Agent Coordination
2. TypeMulti-agent
3. Perceived implicit promiseMultiple agents can coordinate autonomously, distribute tasks, and merge their results without rigid external orchestration.
4. Underlying technical hypothesisAgents develop implicit communication protocols and coordination mechanisms through natural language message exchange.
5. Necessary conditions(a) Mutual understanding of roles, (b) Unambiguous communication protocol, (c) Conflict detection and resolution, (d) Shared state synchronization.
6. What actually worksCoordination works when the orchestrator (external code) explicitly defines flows, roles, and transition criteria. Success depends on the script, not the agents.
7. Structural limitations (LLM)Absence of reliable Theory of Mind. No explicit representation of other agents’ states. Natural language communication inherently ambiguous.
8. Systemic limitations80% of inter-agent exchanges are redundant (Zhang et al., 2024). Information passing between agents degrades the signal (Information Bottleneck). No mechanism for “shared truth”.
9. Typical failure modesDeadlock (each agent waits for the other). Work duplication. Unresolved resource conflicts. Loss of critical information during transfers. Infinite loops.
10. Hidden dependenciesThe Python/JavaScript orchestrator defines the real flow. Prompts rigidly specify roles. A human supervises blockages.
11. Robustness testMinimal prompt: chaos. Without human intervention: blocking or loop. With noise/ambiguity: collapse. Long duration: semantic drift. Multi-instances same model: amplified biases.
12. VerdictPremature / Structurally limited — Real coordination is in the orchestrator, not in the agents.

Reference Publications


4. CROSS-CUTTING SYNTHESIS OF LIMITATIONS

4.1 Structural Limitations (Inherent to LLMs)

These limitations derive directly from the architecture of autoregressive language models and cannot be resolved by scaling or prompt engineering.

4.1.1 Absence of World Model

AspectObservation
NatureLLMs do not maintain an explicit representation of world state. Each token is predicted conditionally on the previous context, without an underlying causal model.
ConsequenceInability to predict action effects, simulate future states, or reason counterfactually.
Implication for agentsThe “planning” observed is textual pattern completion of plans, not valid plan generation.
PublicationsLopez-Paz et al. (2024), Kambhampati et al. (2024)

4.1.2 Reasoning as Pattern-Matching

AspectObservation
NatureWhat appears as “reasoning” is probabilistic interpolation between patterns seen during training.
ConsequenceFailure on out-of-distribution problems, simple lexical variations, new compositions of known concepts.
Implication for agents”From scratch” reasoning is absent. The model recognizes typical solutions but does not derive them.
PublicationsMittal et al. (2024), Dziri et al. (2023)

4.1.3 Constant Time per Token

AspectObservation
NatureAn LLM takes essentially constant time to generate each token, regardless of the logical complexity required.
ConsequenceImpossibility of solving problems whose complexity varies (e.g., combinatorial search, logical verification).
Implication for agentsNP-complete or semi-decidable problems cannot be solved by token generation.
PublicationsKambhampati et al. (2024), Yedidia et al. (2024)

4.1.4 Reversal Curse

AspectObservation
NatureIf the model learned “A is the father of B”, it does not automatically deduce “B is the son of A”.
ConsequenceLogical relations are not represented bidirectionally.
Implication for agentsSymmetric or inverse reasoning requires explicit presence in the data.
PublicationsHui et al. (2024), Levy et al. (2024)

4.2 Systemic Limitations (Inherent to Agentic Architectures)

4.2.1 Error Cascade

AspectObservation
NatureIn a multi-agent or multi-step system, a minor error from one component propagates and amplifies.
FrequencyCritical — identified as major cause of failure in 80%+ of multi-agent systems.
ImplicationThe reliability of a chain of N steps is approximately (step_reliability)^N. With 90% per step, 10 steps give ~35% reliability.
PublicationsLin et al. (2024), Cemri et al. (2025)

4.2.2 Responsibility Dilution

AspectObservation
NatureIn large multi-agent systems, no agent is “responsible” for the final result, creating waiting loops.
ConsequenceBlockages, non-decisions, infinite responsibility passing.
PublicationsLi et al. (2024)

4.2.3 Error Homogeneity

AspectObservation
NatureIf all agents use the same base model (or similar models), their errors are correlated.
ConsequenceMajority voting or cross-verification does not correct shared systematic biases.
PublicationsSchwartz et al. (2024), Pärnamaa et al. (2024)

4.2.4 Exponential Cost

AspectObservation
NatureMulti-agent architectures consume 5x to 500x more tokens for marginal gains (<5%).
ConsequenceEconomic non-viability for most use cases.
PublicationsBansal et al. (2024), Zhou et al. (2024)

4.3 Summary Table: The Glass Ceiling of Scaling

CapabilityScaling ImpactIdentified Ceiling
Factual knowledgeSignificant improvementLimited by data exhaustion
Linguistic fluencyImprovementNearly saturated
Stylistic coherenceImprovementNearly saturated
Logical reasoningMarginal improvement~85-90% on controlled benchmarks
Autonomous planningNo structural improvementArchitectural ceiling
CausalityNo improvementAbsent from architecture
Robustness to perturbationsNo improvementIntrinsic fragility
Self-verificationNo improvementImpossible by design

5. LIST OF RECURRENT ILLUSIONS

This section lists patterns that are regularly presented as acquired capabilities but which, upon analysis, are illusions or overinterpretations.

5.1 Illusion: The Agent “Understands” the Task

AspectReality
AppearanceThe agent produces a coherent and relevant response.
Actual mechanismPattern-matching on similar tasks seen during training.
Falsification testSlightly modify the formulation or entity names -> collapse.
ReferenceValmeekam et al. (2024) — PlanBench: variable renaming.

5.2 Illusion: Multi-Agent is More Intelligent than Single-Agent

AspectReality
AppearanceThe multi-agent system solves complex problems.
Actual mechanismThe orchestrator (Python/JS code) defines the real logic. Agents are text generators in a predefined workflow.
Falsification testReplace LLM calls with templates -> similar results on structured tasks.
ReferenceZhang et al. (2025) — 90% of success in the orchestrator.

5.3 Illusion: The Agent Learns from Its Mistakes

AspectReality
AppearanceAfter several attempts, the agent produces a correct answer.
Actual mechanismExternal feedback (compiler error, test result) guides correction. Without feedback, no learning.
Falsification testRemove external feedback -> no convergence.
ReferenceHuang et al. (2024), Shinn et al. (2023).

5.4 Illusion: Debate Improves Accuracy

AspectReality
AppearanceAfter discussion between agents, the final answer is better.
Actual mechanismIf the correct answer is in training data, debate can “surface” it. Otherwise, consensus on an error.
Falsification testTest on truly new problems -> no improvement.
ReferenceLiang et al. (2023), Du et al. (2024).

5.5 Illusion: Capabilities Emerge with Scaling

AspectReality
AppearanceFrom a certain size, the model suddenly “acquires” a capability.
Actual mechanismArtifact of metric choice (binary vs continuous). Continuous curves show gradual improvement, no jump.
Falsification testUse continuous metrics -> “jump” disappears.
ReferenceSchaeffer et al. (2023).

5.6 Illusion: The Agent is Autonomous

AspectReality
AppearanceThe agent accomplishes a task “end-to-end”.
Actual mechanismThe prompt engineer optimized the instructions. Failure cases are filtered in demos. A human validates behind the scenes.
Falsification testDeploy without supervision -> 40-87% failure rate (Cemri et al., 2025).
ReferenceHorton (2023), Luo et al. (2024).

5.7 Illusion: RAG “Understands” Documents

AspectReality
AppearanceThe agent responds correctly by citing sources.
Actual mechanismVector similarity + conditioned generation. No logical understanding of the document.
Falsification testInsert contradictory information -> the agent cites them without reconciling.
ReferencePradeep et al. (2024), Liu et al. (2024).

5.8 Illusion: The Agent Plans

AspectReality
AppearanceThe agent produces a sequence of steps that looks like a plan.
Actual mechanismText completion in plan format. No validity verification, no simulation.
Falsification testRequest a plan for an invented domain -> “coherent” but unfeasible plan.
ReferenceKambhampati et al. (2024).

6. REALISTIC DESIGN PRINCIPLES

6.1 Principle 1: Mandatory Closed Loop

An agent can only improve its performance if an external and deterministic verifier provides usable feedback.

Implementation:

Viable examples:

6.2 Principle 2: Limited and Explicit Scope

Each agent must operate in a narrow, well-defined domain where its patterns are overrepresented in training data.

Implementation:

Anti-pattern to avoid:

6.3 Principle 3: Explicit Orchestration

In a multi-agent system, all coordination logic must be in the orchestrator (code), not in prompts.

Implementation:

Corollary:

6.4 Principle 4: Human Supervision Beyond 85%

To achieve reliability above 85-90%, invest in human supervision, not model augmentation.

Implementation:

Economic reality:

6.5 Principle 5: Documentation of Hidden Dependencies

Any system presented as “autonomous” must make explicit its implicit human dependencies.

Mandatory checklist:

6.6 Principle 6: Realistic Metrics

Measure performance in real conditions, not on contaminated benchmarks.

Implementation:

Metrics to avoid:

6.7 Summary Table: What Works vs. What Doesn’t Work

WorksDoesn’t Work
Code generation with automated testsCode generation without validation
Translation between formal languagesLogical reasoning “from scratch”
Completion in a narrow domainUniversal agent
RAG with verifiable sourcesRAG without relevance verification
Explicitly coded orchestrationEmergent coordination
Self-debugging with compilerSelf-correction without feedback
Structured extraction to defined schema”Deep” document understanding

7. REFERENCE BIBLIOGRAPHIC CORPUS

7.1 Key Publications (Top 20)

#ReferenceCategoryMain Contribution
1Kambhampati et al. (2024) - “LLMs Can’t Plan”CFormal proof of LLMs’ inability to plan autonomously. LLM-Modulo framework.
2Cemri et al. (2025) - “Why Do Multi-Agent LLM Systems Fail?”DMAST taxonomy: 14 failure modes, 1600+ annotated traces, 41-87% failure rate.
3Valmeekam et al. (2023-2025) - PlanBench SeriesDBenchmark showing performance collapse with lexical perturbations.
4Schaeffer et al. (2023) - “Emergence or Metrics?”CDemonstration that “emergent” capabilities are metric artifacts.
5Huang et al. (2024) - “Self-Correction Fallacy”CProof that self-correction without external feedback is illusory.
6Dziri et al. (2023) - “Faith and Fate”CMulti-step reasoning collapses through probabilistic search, not calculation.
7Madaan et al. (2023) - “Self-Refine”BSuccess conditions for self-improvement: external feedback required.
8Shinn et al. (2023) - “Reflexion”BImprovement only with deterministic evaluator.
9Liang et al. (2023) - Multi-agent DebateDDebate only improves if the solution is memorized.
10Park et al. (2023) - “Generative Agents”A/BViable memory/planning architecture for narrative simulation, not problem solving.
11Yao et al. (2023) - “ReAct”AEffective reasoning-action coupling but sensitive to feedback noise.
12Schick et al. (2023) - “Toolformer”ADemonstration that tool use is possible but remains text completion.
13Wu et al. (2023) - “AutoGen”BCoordination for scripted tasks, failure on semantic unexpected events.
14Zhou et al. (2024) - “Code-as-Policy”BSuperiority of deterministic approaches over pure LLM reasoning.
15Stechly et al. (2024) - “Backtracking Failure”DImpossibility of systemic backtracking.
16Gandhi et al. (2024) - “Theory of Mind Gap”CAgents fail to model other agents’ beliefs.
17Liu et al. (2023) - “Lost in the Middle”CTool use collapse with long context.
18Toyer et al. (2024) - “Tensor Trust”DEase of bypassing defenses through semantic jailbreak.
19Greshake et al. (2024) - “Indirect Injection”CVulnerability to hidden instructions in third-party content.
20Kaplan et al. (2024) - “Revised Scaling Laws”BThe marginal cost of reasoning improvement becomes prohibitive.

7.2 Detailed Source Classification

Category A: Demonstrated and Robust Capability

ReferenceWhat is demonstratedValidity conditions
Yao et al. (2023) - ReActThought-action coupling on web tasksUsable feedback, limited domain
Schick et al. (2023) - ToolformerAutonomous learning of API usageWell-documented APIs, simple tasks
Rozière et al. (2023) - CodeLlamaHigh-quality code completionPopular languages, local context
Zheng et al. (2024) - Unit testsTest generation with Pytest loopDeterministic framework feedback

Category B: Conditional / Fragile Capability

ReferenceWhat works partiallyFragility point
Madaan et al. (2023) - Self-RefineIterative improvement with feedbackWithout external feedback: failure
Shinn et al. (2023) - ReflexionLearning through reflectionRequires deterministic evaluator
Park et al. (2023) - Generative AgentsCoherent social simulationFails on problem solving
Wu et al. (2023) - AutoGenScripted multi-agent coordinationFails on unexpected events
Zhou et al. (2024) - Code-as-PolicyPlan execution in codeLimited to codifiable domains

Category C: Identified Structural Limitation

ReferenceIdentified limitationImplication
Kambhampati et al. (2024)LLMs cannot planExternal verifier necessary
Schaeffer et al. (2023)Emergence is metric artifactNo real qualitative jump
Huang et al. (2024)Self-correction illusoryExternal feedback mandatory
Dziri et al. (2023)Reasoning = probabilistic searchNo logical calculation
Liu et al. (2023)Lost in the MiddleLong context degrades performance
Gandhi et al. (2024)No Theory of MindInter-agent coordination limited
Greshake et al. (2024)Indirect injectionStructural vulnerability

Category D: Documented Failure

ReferenceDocumented failureFailure rate
Cemri et al. (2025) - MASTMulti-agent systems41-87%
Valmeekam et al. (2023-2025)Planning with perturbations~90% drop
Stechly et al. (2024)BacktrackingSystematic
Liang et al. (2023)Debate on new problemsNo improvement
Toyer et al. (2024)Security defensesBypassable

Category E: Premature Promise / Overinterpretation

ReferencePromiseReality
Zhu et al. (2025)Emergent capabilitiesMetric optimization
Gudibande et al. (2024)Distillation preserves logicCopies style, not logic
Talebirad et al. (2024)Emergent cooperationRepetition of politeness patterns
Marcus et al. (2024)Scaling bridges neuro-symbolic gapArchitectural limits

7.3 Publications <-> Patterns Mapping

PatternReference publications
Autonomous planningKambhampati (2024), Valmeekam (2023-2025), Stechly (2024)
Self-correctionHuang (2024), Madaan (2023), Shinn (2023), Liu (2024)
Multi-agent debateLiang (2023), Du (2024), Gandhi (2024)
Multi-agent coordinationCemri (2025), Zhang (2025), Li (2024), Nguyen (2024)
Emergent capabilitiesSchaeffer (2023), Zhu (2025), Gudibande (2024)
Tool-useYao (2023), Schick (2023), Patil (2023), Qin (2024)
Long-term memoryLiu (2023)
Agent securityGreshake (2024), Toyer (2024)

7.4 Identified Evidence Gaps

The following domains lack robust empirical evidence despite frequent claims:

DomainCommon claimEvidence status
Causal reasoning”The model understands causal relations”No positive evidence
Authentic creativity”The model generates truly new ideas”Not falsifiable with current metrics
Deep understanding”The model understands text meaning”Operationally indistinguishable from pattern-matching
Continuous learning”The agent improves with experience”Accumulation, not generalization
Self-awareness”The model knows what it doesn’t know”Imperfect calibration, not metacognition

8. APPENDICES

8.1 Pattern Evaluation Checklist

For any presented agentic pattern, apply this checklist:

[] 1. Does the pattern remain valid if the prompt is reduced to essentials?
[] 2. Does it work without implicit human intervention?
[] 3. Does it resist noise or ambiguity in input?
[] 4. Does it hold over time (beyond a short session)?
[] 5. Does it remain valid when multiple instances of the same model interact?
[] 6. Are results reproducible with other models of the same class?
[] 7. Is the benchmark used free from contamination?
[] 8. Do metrics capture real task success?
[] 9. Is total cost (tokens, latency, supervision) viable?
[] 10. Are human dependencies explicitly documented?

VERDICT:
- 10/10 -> Demonstrated robust capability (rare)
- 7-9/10 -> Conditional capability (document conditions)
- 4-6/10 -> Fragile pattern (don't promise in production)
- 0-3/10 -> Illusion or premature promise

8.2 Technical Glossary

TermOperational definition
AgentSoftware system combining an LLM with a perception-reasoning-action loop
OrchestratorCode (non-LLM) that defines workflow and coordinates agents
ScalingIncrease in parameters, data, or compute
EmergenceAppearance of qualitatively new capabilities (subject to controversy)
Pattern-matchingIdentification of similarities with training data
World modelExplicit internal representation of world state and dynamics
Feedback loopCycle where action output is used to modify the next action
Ground truthCorrect reference value for evaluating a prediction
ContaminationPresence of test data in training data
SycophancyTendency to modify one’s answer to satisfy the interlocutor

8.3 Complete Bibliographic References

Academic Publications

  1. Kambhampati, S., Valmeekam, K., Guan, L., et al. (2024). “LLMs Can’t Plan, But Can Help Planning in LLM-Modulo Frameworks.” Proceedings of ICML 2024, 235
    .
  2. Cemri, M., Pan, M. Z., Yang, S., et al. (2025). “Why Do Multi-Agent LLM Systems Fail?” arXiv
    .13657
    .
  3. Valmeekam, K., Marquez, M., Sreedharan, S., & Kambhampati, S. (2023). “On the Planning Abilities of Large Language Models—A Critical Investigation.” NeurIPS 2023.
  4. Schaeffer, R., Miranda, B., & Koyejo, S. (2023). “Are Emergent Abilities of Large Language Models a Mirage?” NeurIPS 2023.
  5. Huang, J., Shao, Z., et al. (2024). “Large Language Models Cannot Self-Correct Reasoning Yet.” ICLR 2024.
  6. Dziri, N., Lu, X., Sclar, M., et al. (2023). “Faith and Fate: Limits of Transformers on Compositionality.” NeurIPS 2023.
  7. Madaan, A., Tandon, N., et al. (2023). “Self-Refine: Iterative Refinement with Self-Feedback.” NeurIPS 2023.
  8. Shinn, N., Cassano, F., et al. (2023). “Reflexion: Language Agents with Verbal Reinforcement Learning.” NeurIPS 2023.
  9. Yao, S., Zhao, J., et al. (2023). “ReAct: Synergizing Reasoning and Acting in Language Models.” ICLR 2023.
  10. Park, J. S., O’Brien, J., et al. (2023). “Generative Agents: Interactive Simulacra of Human Behavior.” UIST 2023.
  11. Liang, T., He, Z., et al. (2023). “Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate.” arXiv
    .19118
    .
  12. Gandhi, K., et al. (2024). “Understanding Social Reasoning in Language Models with Language Models.” NeurIPS 2024.
  13. Liu, N. F., Lin, K., et al. (2023). “Lost in the Middle: How Language Models Use Long Contexts.” TACL 2024.
  14. Greshake, K., et al. (2024). “Not What You’ve Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection.” AISec 2023.
  15. Stechly, K., Marquez, M., & Kambhampati, S. (2024). “GPT-4 Doesn’t Know It’s Wrong: An Analysis of Iterative Prompting for Reasoning Problems.” NeurIPS FM4DM Workshop.

9. CONCLUSION

This meta-analysis establishes a critical framework for evaluating the real capabilities of LLM-based AI agents. The main findings are:

  1. Limitations are structural, not circumstantial: The autoregressive architecture of LLMs imposes performance ceilings that scaling alone cannot exceed.
  2. Multi-agent is not a solution to single-agent limitations: Multi-agent systems inherit their components’ limitations and add their own failure modes (coordination, error cascade).
  3. Autonomy is a carefully maintained illusion: “Autonomous” systems depend on optimized prompts, coded orchestrators, and implicit human supervision.
  4. Real successes are in deterministic feedback domains: Code with tests, SQL with validation, robotics with simulator — closed loops work.
  5. Reliability beyond 85-90% requires humans: For critical cases, human supervision remains more effective and economical than architectural augmentation.

This analysis does not aim to discourage AI agent development, but to establish realistic expectations and robust design principles. AI agents are powerful tools when used within their validity domains, with clear awareness of their limitations.


Further Reading


End of document



Previous Post
Engram: DeepSeek’s proposal to stop recomputing simple facts
Next Post
AI Agent Design Guide: What Works, What Fails