Context engineering: why too many skills...

Context engineering: why 600 skills make your agents less effective

Executive Summary

Observation: Skill repositories and AGENTS.md files create the impression that adding more context is enough to make an agent better. Recent measurements tell a more nuanced story: context helps when it is targeted, but it can also make results more expensive and worse.

Thesis: The problem is not the isolated quality of skills. The problem is selection. A good skill is useful only if it provides information that is non-obvious, stable, and specific to the project. The precise need of the current task belongs in the prompt.

Key point: Context engineering is as much about removing as adding. The value comes from what you choose not to load.

Implication: Copying a repository with hundreds of skills, or automatically generating an AGENTS.md, often means delegating the one part where human judgment makes a measurable difference: deciding which context deserves to enter the window.

Glossary

Context engineering: : the discipline of designing, selecting, organizing, and testing the context provided to an AI agent.
Agent skill: : a package of instructions, examples, scripts, or resources that an agent can load to better execute a family of tasks.
AGENTS.md: : a context file placed in a repository to tell a coding agent about local conventions, commands, and constraints.
Context window: : the amount of tokens the model can take into account in a prompt or session.
Useless token: : information injected into context that adds no local constraint or knowledge the model could not infer. It is not neutral: it consumes attention budget, time, and money.
Lost in the Middle: : a documented effect where models use information placed in the middle of a long context less reliably than information placed at the beginning or the end.

A GitHub repository. 600 Markdown files. Each one written by someone competent.

Result: the agent is slower, more expensive, and sometimes gives worse answers than it would with five carefully chosen files.

If that does not annoy you, you probably have not yet understood what it implies about your own role.

The library without a librarian

600 books written by experts, stacked in a room without an index or shelves. You ask a brilliant assistant to find an answer. It searches, finds passages that look like the answer but are not, loses time, and ends up producing a worse result than if it had been given five books selected by hand.

That is exactly what happens with agent skill repositories in 2026.

SkillsBench shows that skills improve average performance, but not uniformly. Across 84 evaluated tasks, 49 improve, 25 stay unchanged, and 16 regress. In other words: in nearly 20% of tasks, adding skills lowers performance.

The Evaluating AGENTS.md study goes further: developer-provided files improve success by only about 4% on average, while LLM-generated files reduce it by about 3%. In both cases, context files increase exploration, testing, and reasoning, which pushes costs up by more than 20%.

You can therefore pay more for worse results.

The only really interesting exception is files written with surgical intent. SkillsBench observes that compact, targeted skills provide much more value than exhaustive documentation. The problem is never only the quality of the books. It is the absence of someone deciding which ones to open.

The attic and the circuit breaker

“Yes, but our models have 128K tokens of context. 200K. One million. We have room.”

That is like saying your attic is 200 square meters. If everything is loose and unorganized, good luck finding the circuit breaker when the power goes out.

Models work the same way. Their real capacity is well below the number on the spec sheet. Information in the middle of the context is partially ignored. Dependencies between distant pieces of information collapse.

Every useless token is active noise. It reduces the probability that the right token will be used.

The illusion of delegation

Faced with these findings, the reflex is to ask AI to solve the problem:

“Generate optimized skills for me."
"Write a minimal AGENTS.md."
"Analyze my repo and produce the ideal context.”

SkillsBench has already answered: self-generated skills provide no average gain. Zero. AI can produce well-structured text. It does not always know what should not exist.

The same applies when you copy a repository of 600 community skills. Each skill, taken in isolation, may be excellent. But the value was never in the skill itself. It is in the decision to know whether that rule deserves to exist as stable project context.

It is in everything you chose not to put into the context.

Copying skills means consuming someone else’s judgment without exercising your own. Having AI generate the context means delegating the one part of the process where human intelligence makes a measurable difference.

Your intelligence moved. Did you get the notice?

For decades, a developer’s value lived in the ability to produce: write code, design architectures, solve bugs. In 2026, frontier models do all of that. Not perfectly, but well enough that raw production is no longer the bottleneck.

The bottleneck is context.

Not more text. Better decisions:

what must this agent always know about this project?
what belongs to stable context, and what belongs to the prompt of the moment?
what should stay in the drawer?

That is context engineering. It is subtractive work as much as additive work. And it is work that only a human who understands the project, the domain, and the intent can do correctly.

This skill is not innate. It is built through iteration: write a skill, test the agent with and without it, measure the difference, prune.

Context engineering is not a talent. It is a practice.

The 600-skill repository is the perfect symptom of the old reflex: accumulate, document, cover every case. The developer instinct that says “at worst, it cannot hurt.”

Except the data says otherwise. It can hurt. Measurably. More than 20% extra cost in the AGENTS.md study. Nearly 20% of tasks degraded in SkillsBench. An average LLM-generated AGENTS.md that lowers performance.

Context engineering is the inversion of that instinct.

It is not:

what can I add?

It is:

what can I remove without losing signal?

In practice: the useless-token test

The theory is clear. The question that matters is practical: how do you avoid becoming the person who stacks 600 skills while thinking they are helping?

Before adding a line to your agent context, whether it is a skill, an AGENTS.md, or reference documentation, ask three questions.

1. Does the model already know this?

Standard Python conventions, REST patterns, TypeScript syntax: frontier models already know them. Writing them into your AGENTS.md is repeating the model’s own course back to it.

2. Is it specific to this project?

“Use explicit variable names” is not specific.

“Patient IDs use the NBI-XXXX-YY format and never UUIDs” is specific.

3. Is it a stable rule?

The precise scope of a task belongs in the prompt: “refactor this React component”, “prepare the Kubernetes deployment”, “fix this test”.

The skill should carry what remains true from one task to the next: local conventions, business invariants, repository commands, architectural decisions, non-obvious formats.

An instruction that is true only for today’s request has no place in a skill.

If the answer to any of these three questions is no, the token does not belong there.

An example beats a slogan

You are working on a Bluetooth-connected medical device: firmware, mobile app, stimulation protocol.

Your lead developer writes a skill to guide the agent’s code.

Before: a “senior” but generic skill

Bluetooth services must follow a clear and consistent naming convention throughout the project.

Commands sent to the device must return an explicit result type rather than ambiguous values or scattered exceptions.

Errors reported by the firmware must be normalized in a dedicated layer before being exposed to the rest of the application.

The device lifecycle must be represented by an explicit state machine, with authorized and testable transitions.

It sounds serious. It is well written.

The problem: each sentence says what to do without saying how this project does it.

A frontier model, given TypeScript and medical BLE code, would already apply these principles by default. Consistent naming, explicit types, state machines, priority handling: these are generic good answers for the domain.

You just spent tokens restating what the model already knew. Everything is true. Nothing is useful.

After: targeted, situated skill

BLE services expose only functions prefixed with gvs_.

All stimulation commands return StimResult<T>.

Firmware errors are normalized as EREFError; they must never be thrown directly.

The authorized state machine is strictly: idle -> armed -> active -> cooldown -> idle.

Same dimensions: naming, return type, errors, state machine, priorities.

But every line contains a design decision the model could not guess.

That is pure signal: everything here is non-obvious.

The loading mechanism does not need to be magical

Often, it is enough to store skills where they apply:

stimbox-v4/
  packages/
    firmware/
      .claude/skills/
        eref-safety/SKILL.md
    webapp/
      .claude/skills/
        api-contracts/SKILL.md

The agent working in firmware/ loads skills from firmware/.claude/skills/.

The agent working in webapp/ loads its own.

No routing file, no config, no pseudo-intelligent selection logic. Just the project structure.

It is deterministic, not intelligent. That is exactly why it works.

Maintenance is non-negotiable

A skill that is not tested regularly against a real prompt can drift toward the negative effect measured by SkillsBench.

Agent context is a living product, not a deliverable you push and forget.

If you have read this far thinking “fine, I will write five good skills and move on”: bad news. Context engineering is not a deliverable. It is an ongoing discipline.

Golden standards, not opinions

You need a set of reference prompts, concrete cases from your project, with expected results.

Without that, you are optimizing blind.

When you modify a skill, replay it against those cases. The result improves, degrades, or stays unchanged. That is the only valid judge.

The ultimate test: diff with and without

Run the agent with your skills.

Run it without them.

If the diff is negligible, your skills add nothing. If the diff is negative, they do harm. The only case that justifies the token cost is a clearly positive diff.

Ten minutes. Not doing it means flying blind.

Stop on errors

When the agent makes a mistake, the temptation is to add one more line to the skill.

That is the accumulation reflex.

The right reflex: understand why the error happened. Sometimes the fix is to remove context, not add more.

Changing models changes the results

A skill optimized for Claude does not produce the same results on GPT, nor on the next version of the same model.

What was non-obvious yesterday may become noise tomorrow.

This is full-time work. Someone must own agent context the way someone owns a codebase: test it, version it, trim it.

If nobody does, you will be back to the 600-skill repository within three months.

Final word

Prompt engineering promised: find the right wording, and the model does the rest.

That promise is dead.

Agent performance is decided in the architecture of context: what should be stabilized in a skill, what should remain in the prompt, and what should disappear.

And above all, what not to load.

The 600-skill repository is not necessarily a bad repository. It is a monument to an obsolete reflex. Each skill may be a good book. But the value was never in the books.

It is in the librarian who knows which one to hand you, and especially which ones to leave on the shelf.

If you are copying a skill repository or asking AI to generate your AGENTS.md, ask yourself one question:

are you building context, or consuming someone else’s?

Because the answer determines whether you are the context engineer, or the extra token.