
Related articles: AI Agent Design Guide | Building Blocks for AI Agents | 11 Multi-Agent Orchestration Patterns
Single and Multi-Agent Systems Based on LLMs
1. Introduction
1.1 Analysis Objective
This analysis identifies, classifies, and evaluates patterns related to LLM-based AI agents (Large Language Models) that appear promising but whose feasibility remains structurally limited with current architectures. The goal is to provide a critical, falsifiable, and non-speculative framework for evaluating the real capabilities of agentic systems.
1.2 Scope
Included:
- LLM-based agents (GPT-4, Claude, Llama, etc.)
- Single-agent and multi-agent architectures
- Orchestration, memory, tools, and supervision systems
- Observable and reproducible cases documented in the literature
Strictly excluded:
- Speculative Artificial General Intelligence (AGI)
- Marketing claims without empirical data
- Anthropomorphic descriptions of capabilities
- Untestable hypotheses
1.3 Operational Definitions
AI Agent: LLM-based system capable of perceiving a state, producing local reasoning (token generation), and triggering actions via explicit orchestration (code, API, tools).
Multi-Agent System: Set of agents coordinated by an explicit protocol. Multi-agent does not imply any emergent collective intelligence by default; any appearance of superior coordination comes from the orchestrator or communication protocol.
Scaling: Increase in model parameters, training data, or inference compute.
1.4 Summary of Key Findings
Critical Observations
- Scaling does not solve structural limitations: Increasing model size improves factual knowledge and linguistic fluency but does not correct the absence of autonomous planning, causal reasoning, or deep logical understanding (Kambhampati et al., 2024; Valmeekam et al., 2025).
- Multi-agent systems fail in 41-87% of cases: The MAST study (Cemri et al., 2025) identifies 14 distinct failure modes, 79% of which stem from specification and coordination problems, not technical infrastructure limitations.
- Self-correction without external feedback is illusory: Agents cannot detect their own errors without an external deterministic verifier (compiler, test, oracle). Self-criticism increases confidence without improving accuracy (Huang et al., 2024; Stechly et al., 2024).
- Multi-agent does not outperform single-agent on most benchmarks: Performance gains are marginal and often inferior to simple approaches like best-of-N sampling (Kapoor et al., 2024; Wang et al., 2024).
- “Emergent capabilities” are metric artifacts: Apparent qualitative jumps during scaling result from non-linear metric choices, not real cognitive phase transitions (Schaeffer et al., 2023).
Viable vs. Premature Patterns
| Category | Viability | Examples |
|---|
| Functional | Tasks where feedback is deterministic | Unit tests, code translation, SQL |
| Fragile | Dependency on specific prompts | RAG, self-consistency, ReAct |
| Premature | Promise without robustness proof | Universal agent, autonomous planning |
| Structurally impossible | Contradiction with architecture | Self-verification, intrinsic causal reasoning |
1.5 Design Recommendations
- Prefer closed loops: Agentic success requires an external deterministic verifier (compiler, simulator, automated test).
- Limit scope per agent: Performant agents operate in narrow, well-defined domains, not as “generalists”.
- Treat multi-agent as orchestration, not collective intelligence: The real “locus of decision” is the orchestrator (often Python code), not the agents themselves.
- Assume 85-90% as reliability ceiling: For the last 10%, invest in human supervision rather than model augmentation.
- Document hidden dependencies: Any “autonomous” system must make explicit its implicit human dependencies (prompt engineering, data selection, validation).
2. SYNTHETIC TABLE OF ANALYZED PATTERNS
Legend of Categories
| Code | Meaning |
|---|
| A | Demonstrated and robust capability |
| B | Conditional / fragile capability |
| C | Identified structural limitation |
| D | Documented failure |
| E | Premature promise / overinterpretation |
2.1 Planning and Reasoning Patterns
| # | Pattern | Type | Category | Verdict |
|---|
| 1 | Autonomous planning | Single | C/D | Structurally impossible |
| 2 | Self-verification | Single | C | Structurally impossible |
| 3 | Causal reasoning | Single | C | Premature |
| 4 | Chain-of-Thought | Single | B | Fragile |
| 5 | Automatic backtracking | Single | D | Structurally impossible |
| 6 | Iterative reflection | Single | B/C | Fragile |
| 7 | Self-Consistency | Single | B | Achievable with reservations |
| 8 | Tree-of-Thought | Single | B | Fragile |
2.2 Multi-Agent Patterns
| # | Pattern | Type | Category | Verdict |
|---|
| 9 | Multi-agent debate | Multi | D/E | Premature |
| 10 | Emergent collective intelligence | Multi | E | Illusion |
| 11 | Role specialization | Multi | B/C | Fragile |
| 12 | Autonomous coordination | Multi | D | Premature |
| 13 | Self-organization | Multi | D | Structurally impossible |
| 14 | Cross-verification | Multi | C/D | Fragile |
| 15 | Multi-agent consensus | Multi | D | Illusion |
2.3 Memory and Context Patterns
| # | Pattern | Type | Category | Verdict |
|---|
| 16 | Autonomous long-term memory | Single/Multi | C | Premature |
| 17 | RAG (Retrieval-Augmented) | Single | A/B | Achievable with reservations |
| 18 | Extended context (>100k tokens) | Single | B/C | Fragile |
| 19 | Autonomous memory update | Single | D | Premature |
| # | Pattern | Type | Category | Verdict |
|---|
| 20 | Simple tool-use | Single | A | Achievable |
| 21 | Sequential tool-use (>3 tools) | Single | B/C | Fragile |
| 22 | Self-debugging with compiler | Single | A/B | Achievable |
| 23 | Code-as-Policy | Single | A | Achievable |
| 24 | Unknown tool usage | Single | D | Premature |
2.5 Scaling Patterns
| # | Pattern | Type | Category | Verdict |
|---|
| 25 | Scaling improves reasoning | - | E | Overinterpretation |
| 26 | Emergence through scaling | - | E | Metric artifact |
| 27 | Universal agent through scaling | - | E | Illusion |
| 28 | Reliability through redundancy | Multi | D | Fragile |
3. DETAILED ANALYSIS OF CRITICAL PATTERNS
3.1 Pattern: Autonomous Planning
Complete Analysis Grid
| Field | Content |
|---|
| 1. Pattern name | Autonomous Planning |
| 2. Type | Single-agent |
| 3. Perceived implicit promise | An LLM agent can decompose a complex objective into sub-steps, schedule these steps, and execute them autonomously until achieving the objective. |
| 4. Underlying technical hypothesis | The model has internalized, through training on human text, sufficient representations of causality and sequential logic to generate valid plans. |
| 5. Necessary conditions | (a) Ability to predict action effects, (b) Ability to backtrack when blocked, (c) Maintaining a coherent world model, (d) Distinction between current state and target state. |
| 6. What actually works | The model can generate action sequences that look like valid plans on domains frequent in training data. The textual form of a plan is often correct. |
| 7. Structural limitations (LLM) | LLMs are autoregressive systems with constant time per token. They cannot perform search in a state space. Token generation is not conditioned on logical validity verification. |
| 8. Systemic limitations | No internal mechanism for plan coherence verification. No explicit representation of action preconditions and effects. |
| 9. Typical failure modes | Invalid plans (impossible actions in current state), incomplete plans (forgotten sub-objectives), no recovery on step failure, circular dependencies between steps. |
| 10. Hidden dependencies | The prompt often contains examples of valid plans (few-shot). Humans implicitly validate plan feasibility. Tested domains are overrepresented in data. |
| 11. Robustness test | Minimal prompt: fails. Without human intervention: fails. With noise/ambiguity: fails. Long duration: fails. Multi-instances same model: not applicable. |
| 12. Verdict | Structurally impossible — Autonomous planning contradicts the fundamental architecture of autoregressive LLMs. |
Reference Publications
- Kambhampati et al. (2024) — “LLMs Can’t Plan, But Can Help Planning in LLM-Modulo Frameworks” — ICML 2024. Formally demonstrates that autoregressive LLMs cannot plan by themselves.
- Valmeekam et al. (2023) — PlanBench series. Shows that planning performance collapses with simple variable renamings.
- Stechly et al. (2024) — Analysis of backtracking failure.
3.2 Pattern: Multi-Agent Systems with Debate
Complete Analysis Grid
| Field | Content |
|---|
| 1. Pattern name | Multi-Agent Debate / Discussion |
| 2. Type | Multi-agent |
| 3. Perceived implicit promise | Multiple LLM agents, by confronting their responses, mutually correct their errors and converge toward a more accurate answer than a single agent. |
| 4. Underlying technical hypothesis | Response diversity + an arbitration mechanism allows filtering individual errors (similar to bagging in ML or majority voting). |
| 5. Necessary conditions | (a) Independence of errors between agents, (b) Capacity for constructive criticism, (c) Ability to distinguish a valid argument from a persuasive one, (d) Absence of shared systematic bias. |
| 6. What actually works | Debate improves results only when the correct answer is already “accessible” via training data (distributed memorization). |
| 7. Structural limitations (LLM) | All agents use the same model or similar models (homogeneity). Errors are correlated, not independent. Conformity bias pushes agents to align with the first response. |
| 8. Systemic limitations | No mechanism for objective truth. The most “persuasive” agent (verbose, confident) wins, not the most “correct”. Absence of ground truth prevents convergence toward truth. |
| 9. Typical failure modes | Consensus on a false answer (echo chamber). Error amplification through mutual validation. Infinite discussion loops without convergence. “Dominant” agent imposing its answer. |
| 10. Hidden dependencies | The orchestrator defines debate rules (who speaks when, end criteria). The initial prompt frames the debate. A human often selects the final answer. |
| 11. Robustness test | Minimal prompt: fails. Without human intervention: fails. With noise/ambiguity: fails severely. Long duration: degradation. Multi-instances same model: exacerbates problems. |
| 12. Verdict | Premature / Illusion — Multi-agent debate does not provide real collective intelligence. |
Reference Publications
3.3 Pattern: Self-Correction / Self-Refinement
Complete Analysis Grid
| Field | Content |
|---|
| 1. Pattern name | Self-Correction (Self-Refinement, Reflexion) |
| 2. Type | Single-agent |
| 3. Perceived implicit promise | An agent can detect its own errors, criticize them, and correct them iteratively until producing a valid answer. |
| 4. Underlying technical hypothesis | The model possesses a “meta-cognitive capacity” allowing it to evaluate the quality of its own outputs. |
| 5. Necessary conditions | (a) Error detection capability, (b) Cause diagnosis capability, (c) Appropriate correction generation capability, (d) Reliable stopping criterion. |
| 6. What actually works | Self-correction works if and only if an external verifier provides usable feedback (e.g., compiler error message, automated test result). |
| 7. Structural limitations (LLM) | The model uses the same weights to generate and to critique. Confirmation bias pushes to validate its own answer. No distinct representation of “production” vs “evaluation”. |
| 8. Systemic limitations | Without external signal, the model has no way to distinguish a correct answer from an incorrect but plausible one. The generated “critique” is itself subject to the same biases. |
| 9. Typical failure modes | Validation of a false answer as correct. Changing a correct answer to an incorrect one (sycophancy). Infinite “correction” loops without improvement. Critique without corrective action. |
| 10. Hidden dependencies | The prompt structure induces the “form” of self-criticism. Few-shot examples show how to critique. Humans often validate the final result. |
| 11. Robustness test | Minimal prompt: fails. Without human intervention: fails massively. With noise: degraded performance. Long duration: drift. Multi-instances same model: not applicable. |
| 12. Verdict | Structurally impossible without external verifier — Pure self-verification is an illusion. |
Reference Publications
3.4 Pattern: Emergent Capabilities through Scaling
Complete Analysis Grid
| Field | Content |
|---|
| 1. Pattern name | Emergence through Scaling |
| 2. Type | Architecture (not agent-specific) |
| 3. Perceived implicit promise | By sufficiently increasing model size (parameters, data, compute), qualitatively new capabilities “emerge” discontinuously. |
| 4. Underlying technical hypothesis | Critical complexity thresholds exist beyond which the model acquires reasoning, planning, or understanding capabilities that were previously absent. |
| 5. Necessary conditions | (a) Real existence of phase transitions, (b) Observation independence from chosen metrics, (c) Robustness of emerged capabilities. |
| 6. What actually works | Scaling improves fluency, factual coverage, stylistic coherence, and reduction of trivial hallucinations. Performance on existing benchmarks increases. |
| 7. Structural limitations (LLM) | The observed “jumps” are artifacts of non-linear metrics (e.g., “pass/fail” vs continuous probability). Reasoning capabilities measured by specific benchmarks do not generalize. |
| 8. Systemic limitations | Benchmark contamination (presence in training data) creates an illusion of capability. Scaling does not modify the fundamental architecture (autoregressive, no world model). |
| 9. Typical failure modes | Regression on simple variants of “mastered” problems. Fragility to lexical perturbations. Success on benchmark, failure in real conditions. |
| 10. Hidden dependencies | The benchmark is selected to show the “jump”. Metrics are chosen post-hoc. Comparisons ignore cost (compute, data). |
| 11. Robustness test | Minimal prompt: variable. Without human intervention: partial. With perturbations: frequent failure. Long duration: stable on factual recall. Multi-instances same model: not applicable. |
| 12. Verdict | Overinterpretation / Metric artifact — “Emergent” capabilities are statistical illusions. |
Reference Publications
3.5 Pattern: Autonomous Multi-Agent Coordination
Complete Analysis Grid
| Field | Content |
|---|
| 1. Pattern name | Autonomous Multi-Agent Coordination |
| 2. Type | Multi-agent |
| 3. Perceived implicit promise | Multiple agents can coordinate autonomously, distribute tasks, and merge their results without rigid external orchestration. |
| 4. Underlying technical hypothesis | Agents develop implicit communication protocols and coordination mechanisms through natural language message exchange. |
| 5. Necessary conditions | (a) Mutual understanding of roles, (b) Unambiguous communication protocol, (c) Conflict detection and resolution, (d) Shared state synchronization. |
| 6. What actually works | Coordination works when the orchestrator (external code) explicitly defines flows, roles, and transition criteria. Success depends on the script, not the agents. |
| 7. Structural limitations (LLM) | Absence of reliable Theory of Mind. No explicit representation of other agents’ states. Natural language communication inherently ambiguous. |
| 8. Systemic limitations | 80% of inter-agent exchanges are redundant (Zhang et al., 2024). Information passing between agents degrades the signal (Information Bottleneck). No mechanism for “shared truth”. |
| 9. Typical failure modes | Deadlock (each agent waits for the other). Work duplication. Unresolved resource conflicts. Loss of critical information during transfers. Infinite loops. |
| 10. Hidden dependencies | The Python/JavaScript orchestrator defines the real flow. Prompts rigidly specify roles. A human supervises blockages. |
| 11. Robustness test | Minimal prompt: chaos. Without human intervention: blocking or loop. With noise/ambiguity: collapse. Long duration: semantic drift. Multi-instances same model: amplified biases. |
| 12. Verdict | Premature / Structurally limited — Real coordination is in the orchestrator, not in the agents. |
Reference Publications
4. CROSS-CUTTING SYNTHESIS OF LIMITATIONS
4.1 Structural Limitations (Inherent to LLMs)
These limitations derive directly from the architecture of autoregressive language models and cannot be resolved by scaling or prompt engineering.
4.1.1 Absence of World Model
| Aspect | Observation |
|---|
| Nature | LLMs do not maintain an explicit representation of world state. Each token is predicted conditionally on the previous context, without an underlying causal model. |
| Consequence | Inability to predict action effects, simulate future states, or reason counterfactually. |
| Implication for agents | The “planning” observed is textual pattern completion of plans, not valid plan generation. |
| Publications | Lopez-Paz et al. (2024), Kambhampati et al. (2024) |
4.1.2 Reasoning as Pattern-Matching
| Aspect | Observation |
|---|
| Nature | What appears as “reasoning” is probabilistic interpolation between patterns seen during training. |
| Consequence | Failure on out-of-distribution problems, simple lexical variations, new compositions of known concepts. |
| Implication for agents | ”From scratch” reasoning is absent. The model recognizes typical solutions but does not derive them. |
| Publications | Mittal et al. (2024), Dziri et al. (2023) |
4.1.3 Constant Time per Token
| Aspect | Observation |
|---|
| Nature | An LLM takes essentially constant time to generate each token, regardless of the logical complexity required. |
| Consequence | Impossibility of solving problems whose complexity varies (e.g., combinatorial search, logical verification). |
| Implication for agents | NP-complete or semi-decidable problems cannot be solved by token generation. |
| Publications | Kambhampati et al. (2024), Yedidia et al. (2024) |
4.1.4 Reversal Curse
| Aspect | Observation |
|---|
| Nature | If the model learned “A is the father of B”, it does not automatically deduce “B is the son of A”. |
| Consequence | Logical relations are not represented bidirectionally. |
| Implication for agents | Symmetric or inverse reasoning requires explicit presence in the data. |
| Publications | Hui et al. (2024), Levy et al. (2024) |
4.2 Systemic Limitations (Inherent to Agentic Architectures)
4.2.1 Error Cascade
| Aspect | Observation |
|---|
| Nature | In a multi-agent or multi-step system, a minor error from one component propagates and amplifies. |
| Frequency | Critical — identified as major cause of failure in 80%+ of multi-agent systems. |
| Implication | The reliability of a chain of N steps is approximately (step_reliability)^N. With 90% per step, 10 steps give ~35% reliability. |
| Publications | Lin et al. (2024), Cemri et al. (2025) |
4.2.2 Responsibility Dilution
| Aspect | Observation |
|---|
| Nature | In large multi-agent systems, no agent is “responsible” for the final result, creating waiting loops. |
| Consequence | Blockages, non-decisions, infinite responsibility passing. |
| Publications | Li et al. (2024) |
4.2.3 Error Homogeneity
| Aspect | Observation |
|---|
| Nature | If all agents use the same base model (or similar models), their errors are correlated. |
| Consequence | Majority voting or cross-verification does not correct shared systematic biases. |
| Publications | Schwartz et al. (2024), Pärnamaa et al. (2024) |
4.2.4 Exponential Cost
| Aspect | Observation |
|---|
| Nature | Multi-agent architectures consume 5x to 500x more tokens for marginal gains (<5%). |
| Consequence | Economic non-viability for most use cases. |
| Publications | Bansal et al. (2024), Zhou et al. (2024) |
4.3 Summary Table: The Glass Ceiling of Scaling
| Capability | Scaling Impact | Identified Ceiling |
|---|
| Factual knowledge | Significant improvement | Limited by data exhaustion |
| Linguistic fluency | Improvement | Nearly saturated |
| Stylistic coherence | Improvement | Nearly saturated |
| Logical reasoning | Marginal improvement | ~85-90% on controlled benchmarks |
| Autonomous planning | No structural improvement | Architectural ceiling |
| Causality | No improvement | Absent from architecture |
| Robustness to perturbations | No improvement | Intrinsic fragility |
| Self-verification | No improvement | Impossible by design |
5. LIST OF RECURRENT ILLUSIONS
This section lists patterns that are regularly presented as acquired capabilities but which, upon analysis, are illusions or overinterpretations.
5.1 Illusion: The Agent “Understands” the Task
| Aspect | Reality |
|---|
| Appearance | The agent produces a coherent and relevant response. |
| Actual mechanism | Pattern-matching on similar tasks seen during training. |
| Falsification test | Slightly modify the formulation or entity names -> collapse. |
| Reference | Valmeekam et al. (2024) — PlanBench: variable renaming. |
5.2 Illusion: Multi-Agent is More Intelligent than Single-Agent
| Aspect | Reality |
|---|
| Appearance | The multi-agent system solves complex problems. |
| Actual mechanism | The orchestrator (Python/JS code) defines the real logic. Agents are text generators in a predefined workflow. |
| Falsification test | Replace LLM calls with templates -> similar results on structured tasks. |
| Reference | Zhang et al. (2025) — 90% of success in the orchestrator. |
5.3 Illusion: The Agent Learns from Its Mistakes
| Aspect | Reality |
|---|
| Appearance | After several attempts, the agent produces a correct answer. |
| Actual mechanism | External feedback (compiler error, test result) guides correction. Without feedback, no learning. |
| Falsification test | Remove external feedback -> no convergence. |
| Reference | Huang et al. (2024), Shinn et al. (2023). |
5.4 Illusion: Debate Improves Accuracy
| Aspect | Reality |
|---|
| Appearance | After discussion between agents, the final answer is better. |
| Actual mechanism | If the correct answer is in training data, debate can “surface” it. Otherwise, consensus on an error. |
| Falsification test | Test on truly new problems -> no improvement. |
| Reference | Liang et al. (2023), Du et al. (2024). |
5.5 Illusion: Capabilities Emerge with Scaling
| Aspect | Reality |
|---|
| Appearance | From a certain size, the model suddenly “acquires” a capability. |
| Actual mechanism | Artifact of metric choice (binary vs continuous). Continuous curves show gradual improvement, no jump. |
| Falsification test | Use continuous metrics -> “jump” disappears. |
| Reference | Schaeffer et al. (2023). |
5.6 Illusion: The Agent is Autonomous
| Aspect | Reality |
|---|
| Appearance | The agent accomplishes a task “end-to-end”. |
| Actual mechanism | The prompt engineer optimized the instructions. Failure cases are filtered in demos. A human validates behind the scenes. |
| Falsification test | Deploy without supervision -> 40-87% failure rate (Cemri et al., 2025). |
| Reference | Horton (2023), Luo et al. (2024). |
5.7 Illusion: RAG “Understands” Documents
| Aspect | Reality |
|---|
| Appearance | The agent responds correctly by citing sources. |
| Actual mechanism | Vector similarity + conditioned generation. No logical understanding of the document. |
| Falsification test | Insert contradictory information -> the agent cites them without reconciling. |
| Reference | Pradeep et al. (2024), Liu et al. (2024). |
5.8 Illusion: The Agent Plans
| Aspect | Reality |
|---|
| Appearance | The agent produces a sequence of steps that looks like a plan. |
| Actual mechanism | Text completion in plan format. No validity verification, no simulation. |
| Falsification test | Request a plan for an invented domain -> “coherent” but unfeasible plan. |
| Reference | Kambhampati et al. (2024). |
6. REALISTIC DESIGN PRINCIPLES
6.1 Principle 1: Mandatory Closed Loop
An agent can only improve its performance if an external and deterministic verifier provides usable feedback.
Implementation:
- Always couple the agent with a compiler, interpreter, automated test, or simulator.
- Feedback must be binary (success/failure) or contain structured error information.
- Never rely on the agent’s self-evaluation.
Viable examples:
- Code + Automated unit tests
- SQL + Schema validation
- Robot + Physical simulator
6.2 Principle 2: Limited and Explicit Scope
Each agent must operate in a narrow, well-defined domain where its patterns are overrepresented in training data.
Implementation:
- Explicitly define the boundaries of the action domain.
- Refuse out-of-domain tasks rather than attempting generalization.
- Use multiple specialized agents rather than one “generalist” agent.
Anti-pattern to avoid:
- “Universal agent” that does everything
- Vague prompt like “Solve this problem”
6.3 Principle 3: Explicit Orchestration
In a multi-agent system, all coordination logic must be in the orchestrator (code), not in prompts.
Implementation:
- The workflow is defined in code (Python, etc.), not in natural language.
- Transitions between states are deterministic.
- Agents are “functions” called by the orchestrator, not autonomous entities.
Corollary:
- “Multi-agent” is often a disguised workflow. Assume this reality rather than masking it.
6.4 Principle 4: Human Supervision Beyond 85%
To achieve reliability above 85-90%, invest in human supervision, not model augmentation.
Implementation:
- Define checkpoints where a human validates critical decisions.
- Provide an “escalation” mode to a human operator.
- Measure true cost (human + compute) and not just API cost.
Economic reality:
- The cost of human supervision for the last 10% is often lower than the cost of a complex architecture that fails unpredictably.
6.5 Principle 5: Documentation of Hidden Dependencies
Any system presented as “autonomous” must make explicit its implicit human dependencies.
Mandatory checklist:
6.6 Principle 6: Realistic Metrics
Measure performance in real conditions, not on contaminated benchmarks.
Implementation:
- Use truly novel test data (post-training cutoff).
- Include costs (tokens, latency, human supervision) in metrics.
- Measure robustness to perturbations, not just nominal performance.
Metrics to avoid:
- Accuracy on public benchmark (contamination)
- “Success rate” without success definition
- Cherry-picked comparisons
6.7 Summary Table: What Works vs. What Doesn’t Work
| Works | Doesn’t Work |
|---|
| Code generation with automated tests | Code generation without validation |
| Translation between formal languages | Logical reasoning “from scratch” |
| Completion in a narrow domain | Universal agent |
| RAG with verifiable sources | RAG without relevance verification |
| Explicitly coded orchestration | Emergent coordination |
| Self-debugging with compiler | Self-correction without feedback |
| Structured extraction to defined schema | ”Deep” document understanding |
7. REFERENCE BIBLIOGRAPHIC CORPUS
7.1 Key Publications (Top 20)
| # | Reference | Category | Main Contribution |
|---|
| 1 | Kambhampati et al. (2024) - “LLMs Can’t Plan” | C | Formal proof of LLMs’ inability to plan autonomously. LLM-Modulo framework. |
| 2 | Cemri et al. (2025) - “Why Do Multi-Agent LLM Systems Fail?” | D | MAST taxonomy: 14 failure modes, 1600+ annotated traces, 41-87% failure rate. |
| 3 | Valmeekam et al. (2023-2025) - PlanBench Series | D | Benchmark showing performance collapse with lexical perturbations. |
| 4 | Schaeffer et al. (2023) - “Emergence or Metrics?” | C | Demonstration that “emergent” capabilities are metric artifacts. |
| 5 | Huang et al. (2024) - “Self-Correction Fallacy” | C | Proof that self-correction without external feedback is illusory. |
| 6 | Dziri et al. (2023) - “Faith and Fate” | C | Multi-step reasoning collapses through probabilistic search, not calculation. |
| 7 | Madaan et al. (2023) - “Self-Refine” | B | Success conditions for self-improvement: external feedback required. |
| 8 | Shinn et al. (2023) - “Reflexion” | B | Improvement only with deterministic evaluator. |
| 9 | Liang et al. (2023) - Multi-agent Debate | D | Debate only improves if the solution is memorized. |
| 10 | Park et al. (2023) - “Generative Agents” | A/B | Viable memory/planning architecture for narrative simulation, not problem solving. |
| 11 | Yao et al. (2023) - “ReAct” | A | Effective reasoning-action coupling but sensitive to feedback noise. |
| 12 | Schick et al. (2023) - “Toolformer” | A | Demonstration that tool use is possible but remains text completion. |
| 13 | Wu et al. (2023) - “AutoGen” | B | Coordination for scripted tasks, failure on semantic unexpected events. |
| 14 | Zhou et al. (2024) - “Code-as-Policy” | B | Superiority of deterministic approaches over pure LLM reasoning. |
| 15 | Stechly et al. (2024) - “Backtracking Failure” | D | Impossibility of systemic backtracking. |
| 16 | Gandhi et al. (2024) - “Theory of Mind Gap” | C | Agents fail to model other agents’ beliefs. |
| 17 | Liu et al. (2023) - “Lost in the Middle” | C | Tool use collapse with long context. |
| 18 | Toyer et al. (2024) - “Tensor Trust” | D | Ease of bypassing defenses through semantic jailbreak. |
| 19 | Greshake et al. (2024) - “Indirect Injection” | C | Vulnerability to hidden instructions in third-party content. |
| 20 | Kaplan et al. (2024) - “Revised Scaling Laws” | B | The marginal cost of reasoning improvement becomes prohibitive. |
7.2 Detailed Source Classification
Category A: Demonstrated and Robust Capability
| Reference | What is demonstrated | Validity conditions |
|---|
| Yao et al. (2023) - ReAct | Thought-action coupling on web tasks | Usable feedback, limited domain |
| Schick et al. (2023) - Toolformer | Autonomous learning of API usage | Well-documented APIs, simple tasks |
| Rozière et al. (2023) - CodeLlama | High-quality code completion | Popular languages, local context |
| Zheng et al. (2024) - Unit tests | Test generation with Pytest loop | Deterministic framework feedback |
Category B: Conditional / Fragile Capability
| Reference | What works partially | Fragility point |
|---|
| Madaan et al. (2023) - Self-Refine | Iterative improvement with feedback | Without external feedback: failure |
| Shinn et al. (2023) - Reflexion | Learning through reflection | Requires deterministic evaluator |
| Park et al. (2023) - Generative Agents | Coherent social simulation | Fails on problem solving |
| Wu et al. (2023) - AutoGen | Scripted multi-agent coordination | Fails on unexpected events |
| Zhou et al. (2024) - Code-as-Policy | Plan execution in code | Limited to codifiable domains |
Category C: Identified Structural Limitation
Category D: Documented Failure
Category E: Premature Promise / Overinterpretation
7.3 Publications <-> Patterns Mapping
| Pattern | Reference publications |
|---|
| Autonomous planning | Kambhampati (2024), Valmeekam (2023-2025), Stechly (2024) |
| Self-correction | Huang (2024), Madaan (2023), Shinn (2023), Liu (2024) |
| Multi-agent debate | Liang (2023), Du (2024), Gandhi (2024) |
| Multi-agent coordination | Cemri (2025), Zhang (2025), Li (2024), Nguyen (2024) |
| Emergent capabilities | Schaeffer (2023), Zhu (2025), Gudibande (2024) |
| Tool-use | Yao (2023), Schick (2023), Patil (2023), Qin (2024) |
| Long-term memory | Liu (2023) |
| Agent security | Greshake (2024), Toyer (2024) |
7.4 Identified Evidence Gaps
The following domains lack robust empirical evidence despite frequent claims:
| Domain | Common claim | Evidence status |
|---|
| Causal reasoning | ”The model understands causal relations” | No positive evidence |
| Authentic creativity | ”The model generates truly new ideas” | Not falsifiable with current metrics |
| Deep understanding | ”The model understands text meaning” | Operationally indistinguishable from pattern-matching |
| Continuous learning | ”The agent improves with experience” | Accumulation, not generalization |
| Self-awareness | ”The model knows what it doesn’t know” | Imperfect calibration, not metacognition |
8. APPENDICES
8.1 Pattern Evaluation Checklist
For any presented agentic pattern, apply this checklist:
[] 1. Does the pattern remain valid if the prompt is reduced to essentials?
[] 2. Does it work without implicit human intervention?
[] 3. Does it resist noise or ambiguity in input?
[] 4. Does it hold over time (beyond a short session)?
[] 5. Does it remain valid when multiple instances of the same model interact?
[] 6. Are results reproducible with other models of the same class?
[] 7. Is the benchmark used free from contamination?
[] 8. Do metrics capture real task success?
[] 9. Is total cost (tokens, latency, supervision) viable?
[] 10. Are human dependencies explicitly documented?
VERDICT:
- 10/10 -> Demonstrated robust capability (rare)
- 7-9/10 -> Conditional capability (document conditions)
- 4-6/10 -> Fragile pattern (don't promise in production)
- 0-3/10 -> Illusion or premature promise
8.2 Technical Glossary
| Term | Operational definition |
|---|
| Agent | Software system combining an LLM with a perception-reasoning-action loop |
| Orchestrator | Code (non-LLM) that defines workflow and coordinates agents |
| Scaling | Increase in parameters, data, or compute |
| Emergence | Appearance of qualitatively new capabilities (subject to controversy) |
| Pattern-matching | Identification of similarities with training data |
| World model | Explicit internal representation of world state and dynamics |
| Feedback loop | Cycle where action output is used to modify the next action |
| Ground truth | Correct reference value for evaluating a prediction |
| Contamination | Presence of test data in training data |
| Sycophancy | Tendency to modify one’s answer to satisfy the interlocutor |
8.3 Complete Bibliographic References
Academic Publications
- Kambhampati, S., Valmeekam, K., Guan, L., et al. (2024). “LLMs Can’t Plan, But Can Help Planning in LLM-Modulo Frameworks.” Proceedings of ICML 2024, 235.
- Cemri, M., Pan, M. Z., Yang, S., et al. (2025). “Why Do Multi-Agent LLM Systems Fail?” arXiv.13657.
- Valmeekam, K., Marquez, M., Sreedharan, S., & Kambhampati, S. (2023). “On the Planning Abilities of Large Language Models—A Critical Investigation.” NeurIPS 2023.
- Schaeffer, R., Miranda, B., & Koyejo, S. (2023). “Are Emergent Abilities of Large Language Models a Mirage?” NeurIPS 2023.
- Huang, J., Shao, Z., et al. (2024). “Large Language Models Cannot Self-Correct Reasoning Yet.” ICLR 2024.
- Dziri, N., Lu, X., Sclar, M., et al. (2023). “Faith and Fate: Limits of Transformers on Compositionality.” NeurIPS 2023.
- Madaan, A., Tandon, N., et al. (2023). “Self-Refine: Iterative Refinement with Self-Feedback.” NeurIPS 2023.
- Shinn, N., Cassano, F., et al. (2023). “Reflexion: Language Agents with Verbal Reinforcement Learning.” NeurIPS 2023.
- Yao, S., Zhao, J., et al. (2023). “ReAct: Synergizing Reasoning and Acting in Language Models.” ICLR 2023.
- Park, J. S., O’Brien, J., et al. (2023). “Generative Agents: Interactive Simulacra of Human Behavior.” UIST 2023.
- Liang, T., He, Z., et al. (2023). “Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate.” arXiv.19118.
- Gandhi, K., et al. (2024). “Understanding Social Reasoning in Language Models with Language Models.” NeurIPS 2024.
- Liu, N. F., Lin, K., et al. (2023). “Lost in the Middle: How Language Models Use Long Contexts.” TACL 2024.
- Greshake, K., et al. (2024). “Not What You’ve Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection.” AISec 2023.
- Stechly, K., Marquez, M., & Kambhampati, S. (2024). “GPT-4 Doesn’t Know It’s Wrong: An Analysis of Iterative Prompting for Reasoning Problems.” NeurIPS FM4DM Workshop.
9. CONCLUSION
This meta-analysis establishes a critical framework for evaluating the real capabilities of LLM-based AI agents. The main findings are:
- Limitations are structural, not circumstantial: The autoregressive architecture of LLMs imposes performance ceilings that scaling alone cannot exceed.
- Multi-agent is not a solution to single-agent limitations: Multi-agent systems inherit their components’ limitations and add their own failure modes (coordination, error cascade).
- Autonomy is a carefully maintained illusion: “Autonomous” systems depend on optimized prompts, coded orchestrators, and implicit human supervision.
- Real successes are in deterministic feedback domains: Code with tests, SQL with validation, robotics with simulator — closed loops work.
- Reliability beyond 85-90% requires humans: For critical cases, human supervision remains more effective and economical than architectural augmentation.
This analysis does not aim to discourage AI agent development, but to establish realistic expectations and robust design principles. AI agents are powerful tools when used within their validity domains, with clear awareness of their limitations.
Further Reading
End of document