AI Agent Design Guide

Related articles: AI Agent Design Guide | Building Blocks for AI Agents | 11 Multi-Agent Orchestration Patterns

Single and Multi-Agent Systems Based on LLMs

1. Introduction

1.1 Analysis Objective

This analysis identifies, classifies, and evaluates patterns related to LLM-based AI agents (Large Language Models) that appear promising but whose feasibility remains structurally limited with current architectures. The goal is to provide a critical, falsifiable, and non-speculative framework for evaluating the real capabilities of agentic systems.

1.2 Scope

Included:

LLM-based agents (GPT-4, Claude, Llama, etc.)
Single-agent and multi-agent architectures
Orchestration, memory, tools, and supervision systems
Observable and reproducible cases documented in the literature

Strictly excluded:

Speculative Artificial General Intelligence (AGI)
Marketing claims without empirical data
Anthropomorphic descriptions of capabilities
Untestable hypotheses

1.3 Operational Definitions

AI Agent: LLM-based system capable of perceiving a state, producing local reasoning (token generation), and triggering actions via explicit orchestration (code, API, tools).

Multi-Agent System: Set of agents coordinated by an explicit protocol. Multi-agent does not imply any emergent collective intelligence by default; any appearance of superior coordination comes from the orchestrator or communication protocol.

Scaling: Increase in model parameters, training data, or inference compute.

1.4 Summary of Key Findings

Critical Observations

Scaling does not solve structural limitations: Increasing model size improves factual knowledge and linguistic fluency but does not correct the absence of autonomous planning, causal reasoning, or deep logical understanding (Kambhampati et al., 2024; Valmeekam et al., 2025).
Multi-agent systems fail in 41-87% of cases: The MAST study (Cemri et al., 2025) identifies 14 distinct failure modes, 79% of which stem from specification and coordination problems, not technical infrastructure limitations.
Self-correction without external feedback is illusory: Agents cannot detect their own errors without an external deterministic verifier (compiler, test, oracle). Self-criticism increases confidence without improving accuracy (Huang et al., 2024; Stechly et al., 2024).
Multi-agent does not outperform single-agent on most benchmarks: Performance gains are marginal and often inferior to simple approaches like best-of-N sampling (Kapoor et al., 2024; Wang et al., 2024).
“Emergent capabilities” are metric artifacts: Apparent qualitative jumps during scaling result from non-linear metric choices, not real cognitive phase transitions (Schaeffer et al., 2023).

Viable vs. Premature Patterns

Category	Viability	Examples
Functional	Tasks where feedback is deterministic	Unit tests, code translation, SQL
Fragile	Dependency on specific prompts	RAG, self-consistency, ReAct
Premature	Promise without robustness proof	Universal agent, autonomous planning
Structurally impossible	Contradiction with architecture	Self-verification, intrinsic causal reasoning

1.5 Design Recommendations

Prefer closed loops: Agentic success requires an external deterministic verifier (compiler, simulator, automated test).
Limit scope per agent: Performant agents operate in narrow, well-defined domains, not as “generalists”.
Treat multi-agent as orchestration, not collective intelligence: The real “locus of decision” is the orchestrator (often Python code), not the agents themselves.
Assume 85-90% as reliability ceiling: For the last 10%, invest in human supervision rather than model augmentation.
Document hidden dependencies: Any “autonomous” system must make explicit its implicit human dependencies (prompt engineering, data selection, validation).

2. SYNTHETIC TABLE OF ANALYZED PATTERNS

Legend of Categories

Code	Meaning
A	Demonstrated and robust capability
B	Conditional / fragile capability
C	Identified structural limitation
D	Documented failure
E	Premature promise / overinterpretation

2.1 Planning and Reasoning Patterns

#	Pattern	Type	Category	Verdict
1	Autonomous planning	Single	C/D	Structurally impossible
2	Self-verification	Single	C	Structurally impossible
3	Causal reasoning	Single	C	Premature
4	Chain-of-Thought	Single	B	Fragile
5	Automatic backtracking	Single	D	Structurally impossible
6	Iterative reflection	Single	B/C	Fragile
7	Self-Consistency	Single	B	Achievable with reservations
8	Tree-of-Thought	Single	B	Fragile

2.2 Multi-Agent Patterns

#	Pattern	Type	Category	Verdict
9	Multi-agent debate	Multi	D/E	Premature
10	Emergent collective intelligence	Multi	E	Illusion
11	Role specialization	Multi	B/C	Fragile
12	Autonomous coordination	Multi	D	Premature
13	Self-organization	Multi	D	Structurally impossible
14	Cross-verification	Multi	C/D	Fragile
15	Multi-agent consensus	Multi	D	Illusion

2.3 Memory and Context Patterns

#	Pattern	Type	Category	Verdict
16	Autonomous long-term memory	Single/Multi	C	Premature
17	RAG (Retrieval-Augmented)	Single	A/B	Achievable with reservations
18	Extended context (>100k tokens)	Single	B/C	Fragile
19	Autonomous memory update	Single	D	Premature

2.4 Tools and Execution Patterns

#	Pattern	Type	Category	Verdict
20	Simple tool-use	Single	A	Achievable
21	Sequential tool-use (>3 tools)	Single	B/C	Fragile
22	Self-debugging with compiler	Single	A/B	Achievable
23	Code-as-Policy	Single	A	Achievable
24	Unknown tool usage	Single	D	Premature

2.5 Scaling Patterns

#	Pattern	Type	Category	Verdict
25	Scaling improves reasoning	-	E	Overinterpretation
26	Emergence through scaling	-	E	Metric artifact
27	Universal agent through scaling	-	E	Illusion
28	Reliability through redundancy	Multi	D	Fragile

3. DETAILED ANALYSIS OF CRITICAL PATTERNS

3.1 Pattern: Autonomous Planning

Complete Analysis Grid

Field	Content
1. Pattern name	Autonomous Planning
2. Type	Single-agent
3. Perceived implicit promise	An LLM agent can decompose a complex objective into sub-steps, schedule these steps, and execute them autonomously until achieving the objective.
4. Underlying technical hypothesis	The model has internalized, through training on human text, sufficient representations of causality and sequential logic to generate valid plans.
5. Necessary conditions	(a) Ability to predict action effects, (b) Ability to backtrack when blocked, (c) Maintaining a coherent world model, (d) Distinction between current state and target state.
6. What actually works	The model can generate action sequences that look like valid plans on domains frequent in training data. The textual form of a plan is often correct.
7. Structural limitations (LLM)	LLMs are autoregressive systems with constant time per token. They cannot perform search in a state space. Token generation is not conditioned on logical validity verification.
8. Systemic limitations	No internal mechanism for plan coherence verification. No explicit representation of action preconditions and effects.
9. Typical failure modes	Invalid plans (impossible actions in current state), incomplete plans (forgotten sub-objectives), no recovery on step failure, circular dependencies between steps.
10. Hidden dependencies	The prompt often contains examples of valid plans (few-shot). Humans implicitly validate plan feasibility. Tested domains are overrepresented in data.
11. Robustness test	Minimal prompt: fails. Without human intervention: fails. With noise/ambiguity: fails. Long duration: fails. Multi-instances same model: not applicable.
12. Verdict	Structurally impossible — Autonomous planning contradicts the fundamental architecture of autoregressive LLMs.

Reference Publications

Kambhampati et al. (2024) — “LLMs Can’t Plan, But Can Help Planning in LLM-Modulo Frameworks” — ICML 2024. Formally demonstrates that autoregressive LLMs cannot plan by themselves.
Valmeekam et al. (2023) — PlanBench series. Shows that planning performance collapses with simple variable renamings.
Stechly et al. (2024) — Analysis of backtracking failure.

3.2 Pattern: Multi-Agent Systems with Debate

Complete Analysis Grid

Field	Content
1. Pattern name	Multi-Agent Debate / Discussion
2. Type	Multi-agent
3. Perceived implicit promise	Multiple LLM agents, by confronting their responses, mutually correct their errors and converge toward a more accurate answer than a single agent.
4. Underlying technical hypothesis	Response diversity + an arbitration mechanism allows filtering individual errors (similar to bagging in ML or majority voting).
5. Necessary conditions	(a) Independence of errors between agents, (b) Capacity for constructive criticism, (c) Ability to distinguish a valid argument from a persuasive one, (d) Absence of shared systematic bias.
6. What actually works	Debate improves results only when the correct answer is already “accessible” via training data (distributed memorization).
7. Structural limitations (LLM)	All agents use the same model or similar models (homogeneity). Errors are correlated, not independent. Conformity bias pushes agents to align with the first response.
8. Systemic limitations	No mechanism for objective truth. The most “persuasive” agent (verbose, confident) wins, not the most “correct”. Absence of ground truth prevents convergence toward truth.
9. Typical failure modes	Consensus on a false answer (echo chamber). Error amplification through mutual validation. Infinite discussion loops without convergence. “Dominant” agent imposing its answer.
10. Hidden dependencies	The orchestrator defines debate rules (who speaks when, end criteria). The initial prompt frames the debate. A human often selects the final answer.
11. Robustness test	Minimal prompt: fails. Without human intervention: fails. With noise/ambiguity: fails severely. Long duration: degradation. Multi-instances same model: exacerbates problems.
12. Verdict	Premature / Illusion — Multi-agent debate does not provide real collective intelligence.

Reference Publications

Liang et al. (2023) — Shows that debate only improves if the solution is already in the data.
Du et al. (2024) — Social Contagion: the verbose agent wins over the logical agent.
Cemri et al. (2025) — MAST Framework: 36.94% of multi-agent failures are coordination problems.
Gandhi et al. (2024) — Theory of Mind Gap: agents fail to model other agents’ beliefs.

Complete Analysis Grid

Field	Content
1. Pattern name	Self-Correction (Self-Refinement, Reflexion)
2. Type	Single-agent
3. Perceived implicit promise	An agent can detect its own errors, criticize them, and correct them iteratively until producing a valid answer.
4. Underlying technical hypothesis	The model possesses a “meta-cognitive capacity” allowing it to evaluate the quality of its own outputs.
5. Necessary conditions	(a) Error detection capability, (b) Cause diagnosis capability, (c) Appropriate correction generation capability, (d) Reliable stopping criterion.
6. What actually works	Self-correction works if and only if an external verifier provides usable feedback (e.g., compiler error message, automated test result).
7. Structural limitations (LLM)	The model uses the same weights to generate and to critique. Confirmation bias pushes to validate its own answer. No distinct representation of “production” vs “evaluation”.
8. Systemic limitations	Without external signal, the model has no way to distinguish a correct answer from an incorrect but plausible one. The generated “critique” is itself subject to the same biases.
9. Typical failure modes	Validation of a false answer as correct. Changing a correct answer to an incorrect one (sycophancy). Infinite “correction” loops without improvement. Critique without corrective action.
10. Hidden dependencies	The prompt structure induces the “form” of self-criticism. Few-shot examples show how to critique. Humans often validate the final result.
11. Robustness test	Minimal prompt: fails. Without human intervention: fails massively. With noise: degraded performance. Long duration: drift. Multi-instances same model: not applicable.
12. Verdict	Structurally impossible without external verifier — Pure self-verification is an illusion.

Reference Publications

Huang et al. (2024) — “Self-Correction Fallacy”: without external feedback, the model validates its errors.
Madaan et al. (2023) — Self-Refine: success conditional on feedback quality.
Shinn et al. (2023) — Reflexion: improvement only with deterministic evaluator.
Liu et al. (2024) — Sycophancy Trap: the agent changes a good answer to please the “critic”.

3.4 Pattern: Emergent Capabilities through Scaling

Complete Analysis Grid

Field	Content
1. Pattern name	Emergence through Scaling
2. Type	Architecture (not agent-specific)
3. Perceived implicit promise	By sufficiently increasing model size (parameters, data, compute), qualitatively new capabilities “emerge” discontinuously.
4. Underlying technical hypothesis	Critical complexity thresholds exist beyond which the model acquires reasoning, planning, or understanding capabilities that were previously absent.
5. Necessary conditions	(a) Real existence of phase transitions, (b) Observation independence from chosen metrics, (c) Robustness of emerged capabilities.
6. What actually works	Scaling improves fluency, factual coverage, stylistic coherence, and reduction of trivial hallucinations. Performance on existing benchmarks increases.
7. Structural limitations (LLM)	The observed “jumps” are artifacts of non-linear metrics (e.g., “pass/fail” vs continuous probability). Reasoning capabilities measured by specific benchmarks do not generalize.
8. Systemic limitations	Benchmark contamination (presence in training data) creates an illusion of capability. Scaling does not modify the fundamental architecture (autoregressive, no world model).
9. Typical failure modes	Regression on simple variants of “mastered” problems. Fragility to lexical perturbations. Success on benchmark, failure in real conditions.
10. Hidden dependencies	The benchmark is selected to show the “jump”. Metrics are chosen post-hoc. Comparisons ignore cost (compute, data).
11. Robustness test	Minimal prompt: variable. Without human intervention: partial. With perturbations: frequent failure. Long duration: stable on factual recall. Multi-instances same model: not applicable.
12. Verdict	Overinterpretation / Metric artifact — “Emergent” capabilities are statistical illusions.

Reference Publications

Schaeffer et al. (2023) — “Emergence or Metrics?”: demonstrates that apparent emergence results from metric choices.
Zhu et al. (2025) — Emergence Mirage: observed capabilities are misinterpreted metric optimizations.
Gudibande et al. (2024) — The Imitation Trap: distillation copies style, not logic.

3.5 Pattern: Autonomous Multi-Agent Coordination

Complete Analysis Grid

Field	Content
1. Pattern name	Autonomous Multi-Agent Coordination
2. Type	Multi-agent
3. Perceived implicit promise	Multiple agents can coordinate autonomously, distribute tasks, and merge their results without rigid external orchestration.
4. Underlying technical hypothesis	Agents develop implicit communication protocols and coordination mechanisms through natural language message exchange.
5. Necessary conditions	(a) Mutual understanding of roles, (b) Unambiguous communication protocol, (c) Conflict detection and resolution, (d) Shared state synchronization.
6. What actually works	Coordination works when the orchestrator (external code) explicitly defines flows, roles, and transition criteria. Success depends on the script, not the agents.
7. Structural limitations (LLM)	Absence of reliable Theory of Mind. No explicit representation of other agents’ states. Natural language communication inherently ambiguous.
8. Systemic limitations	80% of inter-agent exchanges are redundant (Zhang et al., 2024). Information passing between agents degrades the signal (Information Bottleneck). No mechanism for “shared truth”.
9. Typical failure modes	Deadlock (each agent waits for the other). Work duplication. Unresolved resource conflicts. Loss of critical information during transfers. Infinite loops.
10. Hidden dependencies	The Python/JavaScript orchestrator defines the real flow. Prompts rigidly specify roles. A human supervises blockages.
11. Robustness test	Minimal prompt: chaos. Without human intervention: blocking or loop. With noise/ambiguity: collapse. Long duration: semantic drift. Multi-instances same model: amplified biases.
12. Verdict	Premature / Structurally limited — Real coordination is in the orchestrator, not in the agents.

Reference Publications

Cemri et al. (2025) — MAST: 14 failure modes identified, 36.94% related to coordination.
Zhang et al. (2025) — Orchestrator Bottleneck: 90% of success depends on the central orchestrator.
Nguyen et al. (2024) — Self-Organizing Fallacy: self-organization leads to chaos without script.

4. CROSS-CUTTING SYNTHESIS OF LIMITATIONS

4.1 Structural Limitations (Inherent to LLMs)

These limitations derive directly from the architecture of autoregressive language models and cannot be resolved by scaling or prompt engineering.

4.1.1 Absence of World Model

Aspect	Observation
Nature	LLMs do not maintain an explicit representation of world state. Each token is predicted conditionally on the previous context, without an underlying causal model.
Consequence	Inability to predict action effects, simulate future states, or reason counterfactually.
Implication for agents	The “planning” observed is textual pattern completion of plans, not valid plan generation.
Publications	Lopez-Paz et al. (2024), Kambhampati et al. (2024)

4.1.2 Reasoning as Pattern-Matching

Aspect	Observation
Nature	What appears as “reasoning” is probabilistic interpolation between patterns seen during training.
Consequence	Failure on out-of-distribution problems, simple lexical variations, new compositions of known concepts.
Implication for agents	”From scratch” reasoning is absent. The model recognizes typical solutions but does not derive them.
Publications	Mittal et al. (2024), Dziri et al. (2023)

4.1.3 Constant Time per Token

Aspect	Observation
Nature	An LLM takes essentially constant time to generate each token, regardless of the logical complexity required.
Consequence	Impossibility of solving problems whose complexity varies (e.g., combinatorial search, logical verification).
Implication for agents	NP-complete or semi-decidable problems cannot be solved by token generation.
Publications	Kambhampati et al. (2024), Yedidia et al. (2024)

4.1.4 Reversal Curse

Aspect	Observation
Nature	If the model learned “A is the father of B”, it does not automatically deduce “B is the son of A”.
Consequence	Logical relations are not represented bidirectionally.
Implication for agents	Symmetric or inverse reasoning requires explicit presence in the data.
Publications	Hui et al. (2024), Levy et al. (2024)

4.2 Systemic Limitations (Inherent to Agentic Architectures)

4.2.1 Error Cascade

Aspect	Observation
Nature	In a multi-agent or multi-step system, a minor error from one component propagates and amplifies.
Frequency	Critical — identified as major cause of failure in 80%+ of multi-agent systems.
Implication	The reliability of a chain of N steps is approximately (step_reliability)^N. With 90% per step, 10 steps give ~35% reliability.
Publications	Lin et al. (2024), Cemri et al. (2025)

4.2.2 Responsibility Dilution

Aspect	Observation
Nature	In large multi-agent systems, no agent is “responsible” for the final result, creating waiting loops.
Consequence	Blockages, non-decisions, infinite responsibility passing.
Publications	Li et al. (2024)

4.2.3 Error Homogeneity

Aspect	Observation
Nature	If all agents use the same base model (or similar models), their errors are correlated.
Consequence	Majority voting or cross-verification does not correct shared systematic biases.
Publications	Schwartz et al. (2024), Pärnamaa et al. (2024)

4.2.4 Exponential Cost

Aspect	Observation
Nature	Multi-agent architectures consume 5x to 500x more tokens for marginal gains (<5%).
Consequence	Economic non-viability for most use cases.
Publications	Bansal et al. (2024), Zhou et al. (2024)

4.3 Summary Table: The Glass Ceiling of Scaling

Capability	Scaling Impact	Identified Ceiling
Factual knowledge	Significant improvement	Limited by data exhaustion
Linguistic fluency	Improvement	Nearly saturated
Stylistic coherence	Improvement	Nearly saturated
Logical reasoning	Marginal improvement	~85-90% on controlled benchmarks
Autonomous planning	No structural improvement	Architectural ceiling
Causality	No improvement	Absent from architecture
Robustness to perturbations	No improvement	Intrinsic fragility
Self-verification	No improvement	Impossible by design

5. LIST OF RECURRENT ILLUSIONS

This section lists patterns that are regularly presented as acquired capabilities but which, upon analysis, are illusions or overinterpretations.

5.1 Illusion: The Agent “Understands” the Task

Aspect	Reality
Appearance	The agent produces a coherent and relevant response.
Actual mechanism	Pattern-matching on similar tasks seen during training.
Falsification test	Slightly modify the formulation or entity names -> collapse.
Reference	Valmeekam et al. (2024) — PlanBench: variable renaming.

5.2 Illusion: Multi-Agent is More Intelligent than Single-Agent

Aspect	Reality
Appearance	The multi-agent system solves complex problems.
Actual mechanism	The orchestrator (Python/JS code) defines the real logic. Agents are text generators in a predefined workflow.
Falsification test	Replace LLM calls with templates -> similar results on structured tasks.
Reference	Zhang et al. (2025) — 90% of success in the orchestrator.

5.3 Illusion: The Agent Learns from Its Mistakes

Aspect	Reality
Appearance	After several attempts, the agent produces a correct answer.
Actual mechanism	External feedback (compiler error, test result) guides correction. Without feedback, no learning.
Falsification test	Remove external feedback -> no convergence.
Reference	Huang et al. (2024), Shinn et al. (2023).

5.4 Illusion: Debate Improves Accuracy

Aspect	Reality
Appearance	After discussion between agents, the final answer is better.
Actual mechanism	If the correct answer is in training data, debate can “surface” it. Otherwise, consensus on an error.
Falsification test	Test on truly new problems -> no improvement.
Reference	Liang et al. (2023), Du et al. (2024).

5.5 Illusion: Capabilities Emerge with Scaling

Aspect	Reality
Appearance	From a certain size, the model suddenly “acquires” a capability.
Actual mechanism	Artifact of metric choice (binary vs continuous). Continuous curves show gradual improvement, no jump.
Falsification test	Use continuous metrics -> “jump” disappears.
Reference	Schaeffer et al. (2023).

5.6 Illusion: The Agent is Autonomous

Aspect	Reality
Appearance	The agent accomplishes a task “end-to-end”.
Actual mechanism	The prompt engineer optimized the instructions. Failure cases are filtered in demos. A human validates behind the scenes.
Falsification test	Deploy without supervision -> 40-87% failure rate (Cemri et al., 2025).
Reference	Horton (2023), Luo et al. (2024).

5.7 Illusion: RAG “Understands” Documents

Aspect	Reality
Appearance	The agent responds correctly by citing sources.
Actual mechanism	Vector similarity + conditioned generation. No logical understanding of the document.
Falsification test	Insert contradictory information -> the agent cites them without reconciling.
Reference	Pradeep et al. (2024), Liu et al. (2024).

5.8 Illusion: The Agent Plans

Aspect	Reality
Appearance	The agent produces a sequence of steps that looks like a plan.
Actual mechanism	Text completion in plan format. No validity verification, no simulation.
Falsification test	Request a plan for an invented domain -> “coherent” but unfeasible plan.
Reference	Kambhampati et al. (2024).

6. REALISTIC DESIGN PRINCIPLES

6.1 Principle 1: Mandatory Closed Loop

An agent can only improve its performance if an external and deterministic verifier provides usable feedback.

Implementation:

Always couple the agent with a compiler, interpreter, automated test, or simulator.
Feedback must be binary (success/failure) or contain structured error information.
Never rely on the agent’s self-evaluation.

Viable examples:

Code + Automated unit tests
SQL + Schema validation
Robot + Physical simulator

6.2 Principle 2: Limited and Explicit Scope

Each agent must operate in a narrow, well-defined domain where its patterns are overrepresented in training data.

Implementation:

Explicitly define the boundaries of the action domain.
Refuse out-of-domain tasks rather than attempting generalization.
Use multiple specialized agents rather than one “generalist” agent.

Anti-pattern to avoid:

“Universal agent” that does everything
Vague prompt like “Solve this problem”

6.3 Principle 3: Explicit Orchestration

In a multi-agent system, all coordination logic must be in the orchestrator (code), not in prompts.

Implementation:

The workflow is defined in code (Python, etc.), not in natural language.
Transitions between states are deterministic.
Agents are “functions” called by the orchestrator, not autonomous entities.

Corollary:

“Multi-agent” is often a disguised workflow. Assume this reality rather than masking it.

6.4 Principle 4: Human Supervision Beyond 85%

To achieve reliability above 85-90%, invest in human supervision, not model augmentation.

Implementation:

Define checkpoints where a human validates critical decisions.
Provide an “escalation” mode to a human operator.
Measure true cost (human + compute) and not just API cost.

Economic reality:

The cost of human supervision for the last 10% is often lower than the cost of a complex architecture that fails unpredictably.

6.5 Principle 5: Documentation of Hidden Dependencies

Any system presented as “autonomous” must make explicit its implicit human dependencies.

Mandatory checklist:

Who wrote and optimized the prompts?
Who selects the few-shot examples?
Who validates the results in demos?
Which failure cases are filtered from metrics?
What human intervention is required in production?

6.6 Principle 6: Realistic Metrics

Measure performance in real conditions, not on contaminated benchmarks.

Implementation:

Use truly novel test data (post-training cutoff).
Include costs (tokens, latency, human supervision) in metrics.
Measure robustness to perturbations, not just nominal performance.

Metrics to avoid:

Accuracy on public benchmark (contamination)
“Success rate” without success definition
Cherry-picked comparisons

6.7 Summary Table: What Works vs. What Doesn’t Work

Works	Doesn’t Work
Code generation with automated tests	Code generation without validation
Translation between formal languages	Logical reasoning “from scratch”
Completion in a narrow domain	Universal agent
RAG with verifiable sources	RAG without relevance verification
Explicitly coded orchestration	Emergent coordination
Self-debugging with compiler	Self-correction without feedback
Structured extraction to defined schema	”Deep” document understanding

7. REFERENCE BIBLIOGRAPHIC CORPUS

7.1 Key Publications (Top 20)

#	Reference	Category	Main Contribution
1	Kambhampati et al. (2024) - “LLMs Can’t Plan”	C	Formal proof of LLMs’ inability to plan autonomously. LLM-Modulo framework.
2	Cemri et al. (2025) - “Why Do Multi-Agent LLM Systems Fail?”	D	MAST taxonomy: 14 failure modes, 1600+ annotated traces, 41-87% failure rate.
3	Valmeekam et al. (2023-2025) - PlanBench Series	D	Benchmark showing performance collapse with lexical perturbations.
4	Schaeffer et al. (2023) - “Emergence or Metrics?”	C	Demonstration that “emergent” capabilities are metric artifacts.
5	Huang et al. (2024) - “Self-Correction Fallacy”	C	Proof that self-correction without external feedback is illusory.
6	Dziri et al. (2023) - “Faith and Fate”	C	Multi-step reasoning collapses through probabilistic search, not calculation.
7	Madaan et al. (2023) - “Self-Refine”	B	Success conditions for self-improvement: external feedback required.
8	Shinn et al. (2023) - “Reflexion”	B	Improvement only with deterministic evaluator.
9	Liang et al. (2023) - Multi-agent Debate	D	Debate only improves if the solution is memorized.
10	Park et al. (2023) - “Generative Agents”	A/B	Viable memory/planning architecture for narrative simulation, not problem solving.
11	Yao et al. (2023) - “ReAct”	A	Effective reasoning-action coupling but sensitive to feedback noise.
12	Schick et al. (2023) - “Toolformer”	A	Demonstration that tool use is possible but remains text completion.
13	Wu et al. (2023) - “AutoGen”	B	Coordination for scripted tasks, failure on semantic unexpected events.
14	Zhou et al. (2024) - “Code-as-Policy”	B	Superiority of deterministic approaches over pure LLM reasoning.
15	Stechly et al. (2024) - “Backtracking Failure”	D	Impossibility of systemic backtracking.
16	Gandhi et al. (2024) - “Theory of Mind Gap”	C	Agents fail to model other agents’ beliefs.
17	Liu et al. (2023) - “Lost in the Middle”	C	Tool use collapse with long context.
18	Toyer et al. (2024) - “Tensor Trust”	D	Ease of bypassing defenses through semantic jailbreak.
19	Greshake et al. (2024) - “Indirect Injection”	C	Vulnerability to hidden instructions in third-party content.
20	Kaplan et al. (2024) - “Revised Scaling Laws”	B	The marginal cost of reasoning improvement becomes prohibitive.

7.2 Detailed Source Classification

Category A: Demonstrated and Robust Capability

Reference	What is demonstrated	Validity conditions
Yao et al. (2023) - ReAct	Thought-action coupling on web tasks	Usable feedback, limited domain
Schick et al. (2023) - Toolformer	Autonomous learning of API usage	Well-documented APIs, simple tasks
Rozière et al. (2023) - CodeLlama	High-quality code completion	Popular languages, local context
Zheng et al. (2024) - Unit tests	Test generation with Pytest loop	Deterministic framework feedback

Category B: Conditional / Fragile Capability

Reference	What works partially	Fragility point
Madaan et al. (2023) - Self-Refine	Iterative improvement with feedback	Without external feedback: failure
Shinn et al. (2023) - Reflexion	Learning through reflection	Requires deterministic evaluator
Park et al. (2023) - Generative Agents	Coherent social simulation	Fails on problem solving
Wu et al. (2023) - AutoGen	Scripted multi-agent coordination	Fails on unexpected events
Zhou et al. (2024) - Code-as-Policy	Plan execution in code	Limited to codifiable domains

Category C: Identified Structural Limitation

Reference	Identified limitation	Implication
Kambhampati et al. (2024)	LLMs cannot plan	External verifier necessary
Schaeffer et al. (2023)	Emergence is metric artifact	No real qualitative jump
Huang et al. (2024)	Self-correction illusory	External feedback mandatory
Dziri et al. (2023)	Reasoning = probabilistic search	No logical calculation
Liu et al. (2023)	Lost in the Middle	Long context degrades performance
Gandhi et al. (2024)	No Theory of Mind	Inter-agent coordination limited
Greshake et al. (2024)	Indirect injection	Structural vulnerability

Category D: Documented Failure

Reference	Documented failure	Failure rate
Cemri et al. (2025) - MAST	Multi-agent systems	41-87%
Valmeekam et al. (2023-2025)	Planning with perturbations	~90% drop
Stechly et al. (2024)	Backtracking	Systematic
Liang et al. (2023)	Debate on new problems	No improvement
Toyer et al. (2024)	Security defenses	Bypassable

Category E: Premature Promise / Overinterpretation

Reference	Promise	Reality
Zhu et al. (2025)	Emergent capabilities	Metric optimization
Gudibande et al. (2024)	Distillation preserves logic	Copies style, not logic
Talebirad et al. (2024)	Emergent cooperation	Repetition of politeness patterns
Marcus et al. (2024)	Scaling bridges neuro-symbolic gap	Architectural limits

7.3 Publications <-> Patterns Mapping

Pattern	Reference publications
Autonomous planning	Kambhampati (2024), Valmeekam (2023-2025), Stechly (2024)
Self-correction	Huang (2024), Madaan (2023), Shinn (2023), Liu (2024)
Multi-agent debate	Liang (2023), Du (2024), Gandhi (2024)
Multi-agent coordination	Cemri (2025), Zhang (2025), Li (2024), Nguyen (2024)
Emergent capabilities	Schaeffer (2023), Zhu (2025), Gudibande (2024)
Tool-use	Yao (2023), Schick (2023), Patil (2023), Qin (2024)
Long-term memory	Liu (2023)
Agent security	Greshake (2024), Toyer (2024)

7.4 Identified Evidence Gaps

The following domains lack robust empirical evidence despite frequent claims:

Domain	Common claim	Evidence status
Causal reasoning	”The model understands causal relations”	No positive evidence
Authentic creativity	”The model generates truly new ideas”	Not falsifiable with current metrics
Deep understanding	”The model understands text meaning”	Operationally indistinguishable from pattern-matching
Continuous learning	”The agent improves with experience”	Accumulation, not generalization
Self-awareness	”The model knows what it doesn’t know”	Imperfect calibration, not metacognition

8. APPENDICES

8.1 Pattern Evaluation Checklist

For any presented agentic pattern, apply this checklist:

[] 1. Does the pattern remain valid if the prompt is reduced to essentials?
[] 2. Does it work without implicit human intervention?
[] 3. Does it resist noise or ambiguity in input?
[] 4. Does it hold over time (beyond a short session)?
[] 5. Does it remain valid when multiple instances of the same model interact?
[] 6. Are results reproducible with other models of the same class?
[] 7. Is the benchmark used free from contamination?
[] 8. Do metrics capture real task success?
[] 9. Is total cost (tokens, latency, supervision) viable?
[] 10. Are human dependencies explicitly documented?

VERDICT:
- 10/10 -> Demonstrated robust capability (rare)
- 7-9/10 -> Conditional capability (document conditions)
- 4-6/10 -> Fragile pattern (don't promise in production)
- 0-3/10 -> Illusion or premature promise

8.2 Technical Glossary

Term	Operational definition
Agent	Software system combining an LLM with a perception-reasoning-action loop
Orchestrator	Code (non-LLM) that defines workflow and coordinates agents
Scaling	Increase in parameters, data, or compute
Emergence	Appearance of qualitatively new capabilities (subject to controversy)
Pattern-matching	Identification of similarities with training data
World model	Explicit internal representation of world state and dynamics
Feedback loop	Cycle where action output is used to modify the next action
Ground truth	Correct reference value for evaluating a prediction
Contamination	Presence of test data in training data
Sycophancy	Tendency to modify one’s answer to satisfy the interlocutor

8.3 Complete Bibliographic References

Academic Publications

Kambhampati, S., Valmeekam, K., Guan, L., et al. (2024). “LLMs Can’t Plan, But Can Help Planning in LLM-Modulo Frameworks.” Proceedings of ICML 2024, 235
.
Cemri, M., Pan, M. Z., Yang, S., et al. (2025). “Why Do Multi-Agent LLM Systems Fail?” arXiv
.13657.
Valmeekam, K., Marquez, M., Sreedharan, S., & Kambhampati, S. (2023). “On the Planning Abilities of Large Language Models—A Critical Investigation.” NeurIPS 2023.
Schaeffer, R., Miranda, B., & Koyejo, S. (2023). “Are Emergent Abilities of Large Language Models a Mirage?” NeurIPS 2023.
Huang, J., Shao, Z., et al. (2024). “Large Language Models Cannot Self-Correct Reasoning Yet.” ICLR 2024.
Dziri, N., Lu, X., Sclar, M., et al. (2023). “Faith and Fate: Limits of Transformers on Compositionality.” NeurIPS 2023.
Madaan, A., Tandon, N., et al. (2023). “Self-Refine: Iterative Refinement with Self-Feedback.” NeurIPS 2023.
Shinn, N., Cassano, F., et al. (2023). “Reflexion: Language Agents with Verbal Reinforcement Learning.” NeurIPS 2023.
Yao, S., Zhao, J., et al. (2023). “ReAct: Synergizing Reasoning and Acting in Language Models.” ICLR 2023.
Park, J. S., O’Brien, J., et al. (2023). “Generative Agents: Interactive Simulacra of Human Behavior.” UIST 2023.
Liang, T., He, Z., et al. (2023). “Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate.” arXiv
.19118.
Gandhi, K., et al. (2024). “Understanding Social Reasoning in Language Models with Language Models.” NeurIPS 2024.
Liu, N. F., Lin, K., et al. (2023). “Lost in the Middle: How Language Models Use Long Contexts.” TACL 2024.
Greshake, K., et al. (2024). “Not What You’ve Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection.” AISec 2023.
Stechly, K., Marquez, M., & Kambhampati, S. (2024). “GPT-4 Doesn’t Know It’s Wrong: An Analysis of Iterative Prompting for Reasoning Problems.” NeurIPS FM4DM Workshop.

9. CONCLUSION

This meta-analysis establishes a critical framework for evaluating the real capabilities of LLM-based AI agents. The main findings are:

Limitations are structural, not circumstantial: The autoregressive architecture of LLMs imposes performance ceilings that scaling alone cannot exceed.
Multi-agent is not a solution to single-agent limitations: Multi-agent systems inherit their components’ limitations and add their own failure modes (coordination, error cascade).
Autonomy is a carefully maintained illusion: “Autonomous” systems depend on optimized prompts, coded orchestrators, and implicit human supervision.
Real successes are in deterministic feedback domains: Code with tests, SQL with validation, robotics with simulator — closed loops work.
Reliability beyond 85-90% requires humans: For critical cases, human supervision remains more effective and economical than architectural augmentation.

This analysis does not aim to discourage AI agent development, but to establish realistic expectations and robust design principles. AI agents are powerful tools when used within their validity domains, with clear awareness of their limitations.

Analysis: Capabilities, Limitations, and Premature Patterns of AI Agents

Single and Multi-Agent Systems Based on LLMs

1. Introduction

1.1 Analysis Objective

1.2 Scope

1.3 Operational Definitions

1.4 Summary of Key Findings

Critical Observations

Viable vs. Premature Patterns

1.5 Design Recommendations

2. SYNTHETIC TABLE OF ANALYZED PATTERNS

Legend of Categories

2.1 Planning and Reasoning Patterns

2.2 Multi-Agent Patterns

2.3 Memory and Context Patterns

2.4 Tools and Execution Patterns

2.5 Scaling Patterns

3. DETAILED ANALYSIS OF CRITICAL PATTERNS

3.1 Pattern: Autonomous Planning

Complete Analysis Grid

Reference Publications

3.2 Pattern: Multi-Agent Systems with Debate

Complete Analysis Grid

Reference Publications

3.3 Pattern: Self-Correction / Self-Refinement

Complete Analysis Grid

Reference Publications

3.4 Pattern: Emergent Capabilities through Scaling

Complete Analysis Grid

Reference Publications

3.5 Pattern: Autonomous Multi-Agent Coordination

Complete Analysis Grid

Reference Publications

4. CROSS-CUTTING SYNTHESIS OF LIMITATIONS

4.1 Structural Limitations (Inherent to LLMs)

4.1.1 Absence of World Model

4.1.2 Reasoning as Pattern-Matching

4.1.3 Constant Time per Token

4.1.4 Reversal Curse

4.2 Systemic Limitations (Inherent to Agentic Architectures)

4.2.1 Error Cascade

4.2.2 Responsibility Dilution

4.2.3 Error Homogeneity

4.2.4 Exponential Cost

4.3 Summary Table: The Glass Ceiling of Scaling

5. LIST OF RECURRENT ILLUSIONS

5.1 Illusion: The Agent “Understands” the Task

5.2 Illusion: Multi-Agent is More Intelligent than Single-Agent

5.3 Illusion: The Agent Learns from Its Mistakes

5.4 Illusion: Debate Improves Accuracy

5.5 Illusion: Capabilities Emerge with Scaling

5.6 Illusion: The Agent is Autonomous

5.7 Illusion: RAG “Understands” Documents

5.8 Illusion: The Agent Plans

6. REALISTIC DESIGN PRINCIPLES

6.1 Principle 1: Mandatory Closed Loop

6.2 Principle 2: Limited and Explicit Scope

6.3 Principle 3: Explicit Orchestration

6.4 Principle 4: Human Supervision Beyond 85%

6.5 Principle 5: Documentation of Hidden Dependencies

6.6 Principle 6: Realistic Metrics

6.7 Summary Table: What Works vs. What Doesn’t Work

7. REFERENCE BIBLIOGRAPHIC CORPUS

7.1 Key Publications (Top 20)

7.2 Detailed Source Classification

Category A: Demonstrated and Robust Capability

Category B: Conditional / Fragile Capability

Category C: Identified Structural Limitation

Category D: Documented Failure

Category E: Premature Promise / Overinterpretation

7.3 Publications <-> Patterns Mapping

7.4 Identified Evidence Gaps

8. APPENDICES

8.1 Pattern Evaluation Checklist

8.2 Technical Glossary

8.3 Complete Bibliographic References

Academic Publications

9. CONCLUSION

Further Reading