Skip to content
Go back

AI Agent Design Guide: What Works, What Fails

Published:  at  10:00 AM
Available Languages:

What works, what doesn’t, and why — based on analysis of publications (2023-2025).

Related articles: Detailed Meta-Analysis of AI Agents | Building Blocks for AI Agents | 11 Multi-Agent Orchestration Patterns

AI Agent Design Guide

The Fundamental Principle

THE GOLDEN RULE

An AI agent succeeds when it generates content that will be validated by an external deterministic system. It fails when it must validate its own work.

This rule explains 90% of documented successes and failures. It breaks down into 3 corollaries:


The Paradox

Why Code is Easier Than a Marketing Plan

This is deeply counterintuitive:

CodeMarketing Plan
Human perception”Difficult, technical""Easy, it’s just text”
Reality for an LLM✅ Easy to validate❌ Impossible to validate automatically

Code has an automatic verifier

Agent generates code Compiler → "Error line 42" Agent fixes Tests → "1 failed" ✓ All tests pass OBJECTIVE SUCCESS

Feedback is: binary, immediate, precise, automatic.

Marketing plan has none

Agent generates marketing plan ??? How to validate? The agent re-reads itself... "Yes, this seems fine" ✗ No objective validation Confirmation bias

Feedback is: subjective, delayed (6 months), ambiguous, human required.

The general rule

The more a task seems “creative” and “human”, the harder it is for an autonomous agent.

The more a task seems “technical” and “rigid”, the easier it is for an autonomous agent.

DomainAutomatic verifier?Agent difficulty
Code✅ Compiler + TestsEasy
SQL✅ Execution + SchemaEasy
Formal math✅ Solver (Lean, Coq)Easy
Extraction → JSON✅ JSON SchemaEasy
Translation EN→FR⚠️ Partial (grammar)Medium
Marketing plan❌ NoneHard
Business strategy❌ NoneHard
Creative writing❌ NoneHard

It’s not a question of intellectual complexity. It’s a question of automatic verifiability.


What Actually Works

The following patterns have solid empirical evidence of success in production.

1. Code Generation with Automatic Validation

AspectDetail
Success rate85-95% on medium complexity tasks
Why it worksThe compiler/tester provides deterministic feedback. The agent iterates until success.
ConditionsExisting unit tests, localized context (1-3 files), clear specification
ReferencesOlausson 2024, Zheng 2024, Jimenez 2024 (SWE-bench)

Implementation: Generate → Run tests → Analyze errors → Fix → Repeat loop. Maximum 5 iterations. The agent doesn’t need to “think”, it reacts to error messages.

2. Natural Language → Structured Format Translation

AspectDetail
Success rate90-98% for SQL, Terraform, CSS, business DSL
Why it worksThe target format constrains the space of possible responses. The rigid structure rejects noise.
ConditionsDefined target schema/grammar, examples in prompt, syntactic validation
ReferencesDong 2024, Databricks 2024, Wang 2024 (Code-as-Policy)

3. Information Extraction to Defined Schema

AspectDetail
Success rate> 95% for PDF/text extraction → JSON/SQL
Why it worksTask of “targeted reading” (metrics) rather than creative synthesis. The schema forces noise rejection.
ConditionsExplicit output schema, defined required fields, completeness validation
ReferencesWang 2024 (ETL), He 2025 (meta-analysis), McKinsey AI 2025

4. RAG with Verifiable Sources

AspectDetail
Success rate85-95% with high-quality pre-filtered sources
Why it worksGrounding on indexed sources eliminates factual hallucination. Success comes from upstream filtering.
ConditionsVerified sources, mandatory citations, knowledge graph for links
ReferencesElicit/Consensus 2024, GraphRAG 2024, Dettmers 2024

5. Orchestration in Code (not Prompts)

AspectDetail
Success rate90% of a multi-agent system’s success depends on the Python/YAML orchestrator
Why it worksCoordination logic is deterministic. Agents execute atomic tasks.
ConditionsHard-coded workflow, agents specialized on narrow tasks, state managed by orchestrator
ReferencesZhang 2025, Zhou 2024 (Code-as-Policy), Wu 2023 (AutoGen)

Implementation: Manager/Worker pattern. The manager (Python/YAML code) decides who does what. Workers (LLM) execute atomic tasks. Never negotiate between agents.

6. Neuro-Symbolic Hybridization

AspectDetail
Success rateHistoric successes: AlphaGeometry (IMO), FunSearch (Cap Set), GNoME (crystals)
Why it worksLLM to generate candidates, formal system (SAT/Prolog/DFT) to validate. Success is guaranteed by mathematical laws.
ConditionsFormalizable domain, external verifier available, feedback loop
ReferencesAssael 2024 (AlphaGeometry), DeepMind FunSearch/GNoME 2023, Topin 2024

Case Study: Get-Shit-Done

Why orchestration frameworks work

Frameworks like BMAD, Get-Shit-Done (GSD), or GitHub Spec Kit show impressive results in software engineering. Let’s analyze why.

Get-Shit-Done Architecture

ORCHESTRATOR (Node.js Code) Questions Agent Research Agents Planning Agent Build Agents PROJECT.md REQUIREMENTS.md .planning/research/ ROADMAP.md STATE.md Atomic commits HUMAN VALIDATION (Approval at each phase) DETERMINISTIC VERIFIERS Compiler • Tests • Linter • Git Binary and immediate feedback

The GSD workflow in detail

# 1. QUESTIONS PHASE — The agent asks questions until it understands
/gsd:new-project
# → Generates: PROJECT.md, REQUIREMENTS.md

# 2. RESEARCH PHASE — Parallel agents explore the domain
# → Generates: .planning/research/

# 3. PLANNING PHASE — Roadmap creation
# → Generates: ROADMAP.md, STATE.md
# → HUMAN VALIDATION: "Approve the roadmap"

# 4. CONTEXT PHASE — Capture preferences before implementation
/gsd:context
# → Generates: CONTEXT.md
# "Visual features → Layout, density, interactions, empty states"
# "APIs/CLIs → Response format, flags, error handling"

# 5. BUILD PHASE — Execution with atomic commits
/gsd:build
# → Each task = 1 commit
# abc123f docs(08-02): complete user registration plan
# def456g feat(08-02): add email confirmation flow
# hij789k feat(08-02): implement password hashing

Why GSD works: Mapping with principles

What GSD doesPrinciple applied
Workflows defined in .md files and Node.js codeDeterministic orchestration — Logic is in code, not prompts
Each agent has a unique role (Questions, Research, Planning, Build)Strict specialization — One agent = one task
”You approve the roadmap” before buildHuman-in-the-loop — Human validation at each phase
PROJECT.md, REQUIREMENTS.md, ROADMAP.mdStructured output — Documents with defined format
Compiler, tests, linter, gitClosed loop — Deterministic feedback
Atomic commits per taskFail fast — Traceability, rollback possible
”Your main context window stays at 30-40%”Minimal context — Subagents with fresh contexts

What GSD does NOT do

❌ The agent does NOT decide when to move to the next phase
   → The orchestrator (code) decides

❌ Agents do NOT negotiate with each other
   → They follow a coded sequential workflow

❌ The agent does NOT self-correct without feedback
   → The compiler/tests provide feedback

❌ The agent does NOT "plan" autonomously
   → It generates candidates that humans validate

The lesson

GSD doesn’t prove that “agents work now”.

GSD proves that when properly structured (coded orchestration + specialization + deterministic feedback + human-in-loop), it works in domains with automatic verifiers.

Software engineering is the sweet spot for AI agents because all success conditions are naturally present:

Success conditionPresent in dev?
Automatic verifier✅ Compiler, linter, tests
Structured output✅ Code = formal format
Memorized patterns✅ Billions of lines in training data
Deterministic feedback✅ “Error on line 42” is unambiguous
Localizable context✅ Files, functions, classes

What Doesn’t Work

The following patterns seem promising but fail structurally.

1. Self-Correction Without External Feedback

AspectDetail
Failure rateAgent validates its own errors or creates new ones in 60-80% of cases
Why it failsSame weights for generating and critiquing = same biases. Confirmation bias. Sycophancy.
AlternativeExternal deterministic feedback: compiler, tests, simulator, formal verifier
ReferencesHuang 2024, Madaan 2023, Valmeekam 2024, Liu 2024

⚠️ TRAP: “The agent will re-read and correct its errors” is an illusion. Without external signal, the agent cannot distinguish an error from a correct response.

2. Autonomous Multi-Step Planning

AspectDetail
Failure rate90% collapse on benchmarks as soon as object names change
Why it failsLLMs generate token by token without world model. No backtracking.
AlternativeSymbolic planner (PDDL) or plan → executable code with assertions
ReferencesKambhampati 2024, Valmeekam 2023-2025, Stechly 2024

3. Multi-Agent Debate to Improve Accuracy

AspectDetail
Failure rateImprovement only if solution is already memorized. Degradation otherwise.
Why it failsModel homogeneity = same biases. Conformity bias. Echo chambers.
AlternativeAuthor/Critic architecture with external verifier
ReferencesLiang 2023, Du 2024, Schwartz 2024

4. Betting on Scaling to Solve Limitations

AspectDetail
Reality”Emergent capabilities” are artifacts of non-linear metrics
Why it failsScaling improves factual knowledge, not reasoning. Exponential diminishing returns.
AlternativeInvest in architecture (feedback loops, specialization)
ReferencesSchaeffer 2023, Kaplan 2024, Jain 2024

5. Universal / Generalist Agent

AspectDetail
Failure rateBeaten by deterministic scripts on 95% of automation tasks
Why it failsImpossible without domain specialization. Tool position bias.
AlternativeStrict specialization: 3-5 tools max per agent, narrow domain
ReferencesZhang 2025, Song 2024, Yadav 2024

6. “Emergent” Multi-Agent Coordination

AspectDetail
Failure rate80% of agent exchanges are redundant. Chaos without directing script.
Why it failsNo Theory of Mind. Ambiguous communication. State synchronization impossible.
AlternativeExplicit hierarchical orchestration. Manager (code) + Workers (LLM).
ReferencesNguyen 2024, Zhang 2024, Li 2024

Decision Matrix

Checklist: Will Your Agent Work?

QuestionYes →No →
Is there an external deterministic verifier?✅ Viable⚠️ Risky
Is output constrained by a schema/format?FavorableCaution
Does context fit in < 10 steps / 3 files?FeasibleFragile
Is orchestration coded (not in prompts)?RobustUnstable
Does each agent have ≤ 5 specialized tools?OptimalOverloaded
Is human supervision planned for > 15%?RealisticOver-promised
Is the pattern over-represented in training data?PerformantHallucinations

Score:

Quick decision tree

Does your task have an automatic verifier? YES (code, SQL, JSON, math) Automate aggressively Generate → Verify → Fix Ex: BMAD, GSD NO (strategy, creative, consulting) Can you create one? YES (metrics, A/B, rules) Create the verifier then automate NO (subjective judgment) Agent ASSISTS Human DECIDES Delayed feedback (weeks/months)? Iteration impossible Intensive human supervision required

7 Design Principles

1. Mandatory Closed Loop

Any agent that generates content must have an external verifier. If you can’t define an automatic test, reduce scope until you can.

2. Strict Specialization

An effective agent does one thing well. Versatility is the enemy of reliability.

3. Deterministic Orchestration

Coordination logic must be in code, not prompts.

4. Minimal Context

The shorter the context, the more reliable the agent.

5. Integrated Human Supervision

Plan for > 15% human supervision. Invest in supervision rather than model scaling.

6. Fail Fast, Fail Loud

The agent must fail quickly and explicitly rather than silently produce wrong results.

7. Test in Real Conditions

Benchmarks lie. Only real deployment validates an agent.


Summary by Use Case

✅ VIABLE — Automate

Use CaseRecommended Pattern
Code generation with testsCompile/Test loop (BMAD, GSD)
NL → SQL/DSL/TerraformOutput constraint + validation
PDF extraction → JSONStrict schema + validation
RAG on verified corpusPre-filtered sources + citations
UI automation (forms)DOM Tree + robust selectors
Data wranglingScript generation + execution
Complete dev workflowCoded orchestration + human-in-loop

⚠️ CONDITIONAL — With precautions

Use CaseRecommended Pattern
Presentation generationTemplate + structured filling
Long document analysisChunking + supervised aggregation
Complex bug resolutionLocalized context + human-in-loop
TranslationGrammar validation + human review

❌ NOT VIABLE — Rethink architecture

Use CaseAlternative
Long-term autonomous planningUse symbolic planner
Multi-step reasoning “from scratch”Break down into verifiable steps
Self-correction without feedbackAdd external verifier
”Generalist” universal agentSpecialize by domain
Emergent multi-agent coordinationExplicit coded orchestration
Autonomous marketing planAgent generates variants, human chooses
Autonomous business strategyAssistant with human validation

Conclusion

Key Takeaways

AI agents are not “autonomous intelligences”. They are probabilistic pattern-matching systems that work remarkably well WHEN coupled with deterministic verifiers and constrained to a specialized domain.

The correct mental model

AGENT Candidate generator + VERIFIER Valid solution selector + ORCHESTRATOR Coded coordination logic very good at this indispensable non-negotiable RELIABLE SYSTEM

Final word

Build systems where the LLM generates and a verifier validates.

Any other architecture is, in 2025, an unkept promise.


Further Reading



Previous Post
Analysis: Capabilities, Limitations, and Premature Patterns of AI Agents
Next Post
The 11 Multi-Agent Orchestration Patterns: Complete Guide