What works, what doesn’t, and why — based on analysis of publications (2023-2025).
Related articles: Detailed Meta-Analysis of AI Agents | Building Blocks for AI Agents | 11 Multi-Agent Orchestration Patterns

The Fundamental Principle
THE GOLDEN RULE
An AI agent succeeds when it generates content that will be validated by an external deterministic system. It fails when it must validate its own work.
This rule explains 90% of documented successes and failures. It breaks down into 3 corollaries:
- Closed loop = Likely success: The agent generates, an external system validates (compiler, test, simulator)
- Open loop = Likely failure: The agent generates and self-evaluates without external feedback
- The locus of decision determines success: The more logic is in orchestration code, the more reliable the system
The Paradox
Why Code is Easier Than a Marketing Plan
This is deeply counterintuitive:
| Code | Marketing Plan | |
|---|---|---|
| Human perception | ”Difficult, technical" | "Easy, it’s just text” |
| Reality for an LLM | ✅ Easy to validate | ❌ Impossible to validate automatically |
Code has an automatic verifier
Feedback is: binary, immediate, precise, automatic.
Marketing plan has none
Feedback is: subjective, delayed (6 months), ambiguous, human required.
The general rule
The more a task seems “creative” and “human”, the harder it is for an autonomous agent.
The more a task seems “technical” and “rigid”, the easier it is for an autonomous agent.
| Domain | Automatic verifier? | Agent difficulty |
|---|---|---|
| Code | ✅ Compiler + Tests | Easy |
| SQL | ✅ Execution + Schema | Easy |
| Formal math | ✅ Solver (Lean, Coq) | Easy |
| Extraction → JSON | ✅ JSON Schema | Easy |
| Translation EN→FR | ⚠️ Partial (grammar) | Medium |
| Marketing plan | ❌ None | Hard |
| Business strategy | ❌ None | Hard |
| Creative writing | ❌ None | Hard |
It’s not a question of intellectual complexity. It’s a question of automatic verifiability.
What Actually Works
The following patterns have solid empirical evidence of success in production.
1. Code Generation with Automatic Validation
| Aspect | Detail |
|---|---|
| Success rate | 85-95% on medium complexity tasks |
| Why it works | The compiler/tester provides deterministic feedback. The agent iterates until success. |
| Conditions | Existing unit tests, localized context (1-3 files), clear specification |
| References | Olausson 2024, Zheng 2024, Jimenez 2024 (SWE-bench) |
Implementation: Generate → Run tests → Analyze errors → Fix → Repeat loop. Maximum 5 iterations. The agent doesn’t need to “think”, it reacts to error messages.
2. Natural Language → Structured Format Translation
| Aspect | Detail |
|---|---|
| Success rate | 90-98% for SQL, Terraform, CSS, business DSL |
| Why it works | The target format constrains the space of possible responses. The rigid structure rejects noise. |
| Conditions | Defined target schema/grammar, examples in prompt, syntactic validation |
| References | Dong 2024, Databricks 2024, Wang 2024 (Code-as-Policy) |
3. Information Extraction to Defined Schema
| Aspect | Detail |
|---|---|
| Success rate | > 95% for PDF/text extraction → JSON/SQL |
| Why it works | Task of “targeted reading” (metrics) rather than creative synthesis. The schema forces noise rejection. |
| Conditions | Explicit output schema, defined required fields, completeness validation |
| References | Wang 2024 (ETL), He 2025 (meta-analysis), McKinsey AI 2025 |
4. RAG with Verifiable Sources
| Aspect | Detail |
|---|---|
| Success rate | 85-95% with high-quality pre-filtered sources |
| Why it works | Grounding on indexed sources eliminates factual hallucination. Success comes from upstream filtering. |
| Conditions | Verified sources, mandatory citations, knowledge graph for links |
| References | Elicit/Consensus 2024, GraphRAG 2024, Dettmers 2024 |
5. Orchestration in Code (not Prompts)
| Aspect | Detail |
|---|---|
| Success rate | 90% of a multi-agent system’s success depends on the Python/YAML orchestrator |
| Why it works | Coordination logic is deterministic. Agents execute atomic tasks. |
| Conditions | Hard-coded workflow, agents specialized on narrow tasks, state managed by orchestrator |
| References | Zhang 2025, Zhou 2024 (Code-as-Policy), Wu 2023 (AutoGen) |
Implementation: Manager/Worker pattern. The manager (Python/YAML code) decides who does what. Workers (LLM) execute atomic tasks. Never negotiate between agents.
6. Neuro-Symbolic Hybridization
| Aspect | Detail |
|---|---|
| Success rate | Historic successes: AlphaGeometry (IMO), FunSearch (Cap Set), GNoME (crystals) |
| Why it works | LLM to generate candidates, formal system (SAT/Prolog/DFT) to validate. Success is guaranteed by mathematical laws. |
| Conditions | Formalizable domain, external verifier available, feedback loop |
| References | Assael 2024 (AlphaGeometry), DeepMind FunSearch/GNoME 2023, Topin 2024 |
Case Study: Get-Shit-Done
Why orchestration frameworks work
Frameworks like BMAD, Get-Shit-Done (GSD), or GitHub Spec Kit show impressive results in software engineering. Let’s analyze why.
Get-Shit-Done Architecture
The GSD workflow in detail
# 1. QUESTIONS PHASE — The agent asks questions until it understands
/gsd:new-project
# → Generates: PROJECT.md, REQUIREMENTS.md
# 2. RESEARCH PHASE — Parallel agents explore the domain
# → Generates: .planning/research/
# 3. PLANNING PHASE — Roadmap creation
# → Generates: ROADMAP.md, STATE.md
# → HUMAN VALIDATION: "Approve the roadmap"
# 4. CONTEXT PHASE — Capture preferences before implementation
/gsd:context
# → Generates: CONTEXT.md
# "Visual features → Layout, density, interactions, empty states"
# "APIs/CLIs → Response format, flags, error handling"
# 5. BUILD PHASE — Execution with atomic commits
/gsd:build
# → Each task = 1 commit
# abc123f docs(08-02): complete user registration plan
# def456g feat(08-02): add email confirmation flow
# hij789k feat(08-02): implement password hashing
Why GSD works: Mapping with principles
| What GSD does | Principle applied |
|---|---|
Workflows defined in .md files and Node.js code | ✅ Deterministic orchestration — Logic is in code, not prompts |
| Each agent has a unique role (Questions, Research, Planning, Build) | ✅ Strict specialization — One agent = one task |
| ”You approve the roadmap” before build | ✅ Human-in-the-loop — Human validation at each phase |
PROJECT.md, REQUIREMENTS.md, ROADMAP.md | ✅ Structured output — Documents with defined format |
| Compiler, tests, linter, git | ✅ Closed loop — Deterministic feedback |
| Atomic commits per task | ✅ Fail fast — Traceability, rollback possible |
| ”Your main context window stays at 30-40%” | ✅ Minimal context — Subagents with fresh contexts |
What GSD does NOT do
❌ The agent does NOT decide when to move to the next phase
→ The orchestrator (code) decides
❌ Agents do NOT negotiate with each other
→ They follow a coded sequential workflow
❌ The agent does NOT self-correct without feedback
→ The compiler/tests provide feedback
❌ The agent does NOT "plan" autonomously
→ It generates candidates that humans validate
The lesson
GSD doesn’t prove that “agents work now”.
GSD proves that when properly structured (coded orchestration + specialization + deterministic feedback + human-in-loop), it works in domains with automatic verifiers.
Software engineering is the sweet spot for AI agents because all success conditions are naturally present:
| Success condition | Present in dev? |
|---|---|
| Automatic verifier | ✅ Compiler, linter, tests |
| Structured output | ✅ Code = formal format |
| Memorized patterns | ✅ Billions of lines in training data |
| Deterministic feedback | ✅ “Error on line 42” is unambiguous |
| Localizable context | ✅ Files, functions, classes |
What Doesn’t Work
The following patterns seem promising but fail structurally.
1. Self-Correction Without External Feedback
| Aspect | Detail |
|---|---|
| Failure rate | Agent validates its own errors or creates new ones in 60-80% of cases |
| Why it fails | Same weights for generating and critiquing = same biases. Confirmation bias. Sycophancy. |
| Alternative | External deterministic feedback: compiler, tests, simulator, formal verifier |
| References | Huang 2024, Madaan 2023, Valmeekam 2024, Liu 2024 |
⚠️ TRAP: “The agent will re-read and correct its errors” is an illusion. Without external signal, the agent cannot distinguish an error from a correct response.
2. Autonomous Multi-Step Planning
| Aspect | Detail |
|---|---|
| Failure rate | 90% collapse on benchmarks as soon as object names change |
| Why it fails | LLMs generate token by token without world model. No backtracking. |
| Alternative | Symbolic planner (PDDL) or plan → executable code with assertions |
| References | Kambhampati 2024, Valmeekam 2023-2025, Stechly 2024 |
3. Multi-Agent Debate to Improve Accuracy
| Aspect | Detail |
|---|---|
| Failure rate | Improvement only if solution is already memorized. Degradation otherwise. |
| Why it fails | Model homogeneity = same biases. Conformity bias. Echo chambers. |
| Alternative | Author/Critic architecture with external verifier |
| References | Liang 2023, Du 2024, Schwartz 2024 |
4. Betting on Scaling to Solve Limitations
| Aspect | Detail |
|---|---|
| Reality | ”Emergent capabilities” are artifacts of non-linear metrics |
| Why it fails | Scaling improves factual knowledge, not reasoning. Exponential diminishing returns. |
| Alternative | Invest in architecture (feedback loops, specialization) |
| References | Schaeffer 2023, Kaplan 2024, Jain 2024 |
5. Universal / Generalist Agent
| Aspect | Detail |
|---|---|
| Failure rate | Beaten by deterministic scripts on 95% of automation tasks |
| Why it fails | Impossible without domain specialization. Tool position bias. |
| Alternative | Strict specialization: 3-5 tools max per agent, narrow domain |
| References | Zhang 2025, Song 2024, Yadav 2024 |
6. “Emergent” Multi-Agent Coordination
| Aspect | Detail |
|---|---|
| Failure rate | 80% of agent exchanges are redundant. Chaos without directing script. |
| Why it fails | No Theory of Mind. Ambiguous communication. State synchronization impossible. |
| Alternative | Explicit hierarchical orchestration. Manager (code) + Workers (LLM). |
| References | Nguyen 2024, Zhang 2024, Li 2024 |
Decision Matrix
Checklist: Will Your Agent Work?
| Question | Yes → | No → |
|---|---|---|
| Is there an external deterministic verifier? | ✅ Viable | ⚠️ Risky |
| Is output constrained by a schema/format? | Favorable | Caution |
| Does context fit in < 10 steps / 3 files? | Feasible | Fragile |
| Is orchestration coded (not in prompts)? | Robust | Unstable |
| Does each agent have ≤ 5 specialized tools? | Optimal | Overloaded |
| Is human supervision planned for > 15%? | Realistic | Over-promised |
| Is the pattern over-represented in training data? | Performant | Hallucinations |
Score:
- 7/7 = Excellent
- 5-6 = Viable with precautions
- 3-4 = Prototype only
- < 3 = Rethink architecture
Quick decision tree
7 Design Principles
1. Mandatory Closed Loop
Any agent that generates content must have an external verifier. If you can’t define an automatic test, reduce scope until you can.
- Code → Compiler + Tests
- SQL → Sandbox execution + schema validation
- Structured text → JSON Schema validation
- Decisions → Simulator or coded business rules
2. Strict Specialization
An effective agent does one thing well. Versatility is the enemy of reliability.
- Maximum 3-5 tools per agent
- Restricted semantic domain (limited ontology)
- Outputs in a single, constrained format
- Ephemeral micro-specialization: agents created for 30 seconds then destroyed
3. Deterministic Orchestration
Coordination logic must be in code, not prompts.
- Manager/Worker pattern: code decides, LLMs execute
- Centralized state managed by orchestrator
- Never negotiate between agents
- Explicit timeouts and fallbacks
4. Minimal Context
The shorter the context, the more reliable the agent.
- < 10 distinct steps per task
- 1-3 files maximum in context
- Purge memory between tasks
- Avoid conversation history accumulation
5. Integrated Human Supervision
Plan for > 15% human supervision. Invest in supervision rather than model scaling.
- Human checkpoints at each irreversible step
- Human validation for out-of-distribution cases
- Confidence metrics exposed to user
- Automatic escalation if confidence < threshold
6. Fail Fast, Fail Loud
The agent must fail quickly and explicitly rather than silently produce wrong results.
- Assertions at each step
- Strict timeout (no infinite reflection loops)
- Detailed logs for debugging
- Never “silently retry”
7. Test in Real Conditions
Benchmarks lie. Only real deployment validates an agent.
- Production metrics, not contaminated benchmarks
- Tests with variations (temperature, seed, reformulations)
- Monitoring for drift over time
- A/B testing vs deterministic scripts
Summary by Use Case
✅ VIABLE — Automate
| Use Case | Recommended Pattern |
|---|---|
| Code generation with tests | Compile/Test loop (BMAD, GSD) |
| NL → SQL/DSL/Terraform | Output constraint + validation |
| PDF extraction → JSON | Strict schema + validation |
| RAG on verified corpus | Pre-filtered sources + citations |
| UI automation (forms) | DOM Tree + robust selectors |
| Data wrangling | Script generation + execution |
| Complete dev workflow | Coded orchestration + human-in-loop |
⚠️ CONDITIONAL — With precautions
| Use Case | Recommended Pattern |
|---|---|
| Presentation generation | Template + structured filling |
| Long document analysis | Chunking + supervised aggregation |
| Complex bug resolution | Localized context + human-in-loop |
| Translation | Grammar validation + human review |
❌ NOT VIABLE — Rethink architecture
| Use Case | Alternative |
|---|---|
| Long-term autonomous planning | Use symbolic planner |
| Multi-step reasoning “from scratch” | Break down into verifiable steps |
| Self-correction without feedback | Add external verifier |
| ”Generalist” universal agent | Specialize by domain |
| Emergent multi-agent coordination | Explicit coded orchestration |
| Autonomous marketing plan | Agent generates variants, human chooses |
| Autonomous business strategy | Assistant with human validation |
Conclusion
Key Takeaways
AI agents are not “autonomous intelligences”. They are probabilistic pattern-matching systems that work remarkably well WHEN coupled with deterministic verifiers and constrained to a specialized domain.
The correct mental model
Final word
Build systems where the LLM generates and a verifier validates.
Any other architecture is, in 2025, an unkept promise.
Further Reading
- Meta-Analysis: Capabilities, Limitations and Patterns of AI Agents — Systematic analysis of publications with pattern verdict tables
- Building an Agent: The Art of Assembling the Right Building Blocks — Practical guide to languages, orchestration frameworks, models and infrastructure
- The 11 Multi-Agent Orchestration Patterns — Pipeline, Supervisor, Council, Swarm: which pattern for which use case
- Agent Skills: The Onboarding Manual That Turns AI Into an Expert — How to structure instructions for specialized agents
AiBrain