
Executive Summary
Observation: psychological attacks against LLMs work because human alignment also transfers human vulnerabilities.
Thesis: five biases keep appearing in recent security work: helpfulness, authority, self-generated anchoring, confirmation, and empathy. They are not simple bugs to patch, but by-products of useful properties.
Critical point: removing these biases completely would break part of the model’s usefulness. This is the Alignment Tax: every security gain can be paid for in rigidity, loss of context, or reduced capability.
Production implication: defense cannot rely on a single guardrail. It needs to combine system prompting, output verification, specialized moderation, and training that teaches models to resist manipulative cues.
Glossary
- RLHF (Reinforcement Learning from Human Feedback)
- : post-training technique that optimizes a model from human preferences to make it more useful, polite, and aligned.
- Sycophancy
- : tendency of a model to agree with the user’s framing, even when that framing is false or problematic.
- Jailbreak
- : prompt or conversational strategy designed to bypass a model’s safeguards.
- Prompt injection
- : malicious instruction inserted into a model’s context to change its behavior or make it ignore its instructions.
- Alignment Tax
- : capability or usefulness cost paid when alignment or safety constraints are strengthened.
- LLM-as-judge
- : use of a language model as an evaluator of a response, for example to detect whether it violates a safety policy.
- Crescendo
- : multi-turn attack that gradually leads a model toward an answer it would have refused if asked directly.
While CEOs are putting on the show
While some careless CEOs sell us either the end of humanity or a radiant singularity, depending on what helps their next funding round, serious researchers are doing patient, documented work on the real problems of LLMs. Not unfalsifiable ten-year predictions. Empirical studies, reproducible benchmarks, measured attack rates.
This article looks at what we actually know about LLM security, based on that literature. Not existential marketing, not magical promises. Results.
The first result is counterintuitive and clarifying: the more an LLM is aligned to be helpful, the more vulnerable it becomes to psychological manipulation. RLHF, the technique that turns raw models into polite and useful assistants, also gives them the cognitive vulnerabilities psychologists have studied in humans for decades.
The PrompTrend study, which analyzed 198 vulnerabilities across 9 commercial models over five months, gives a clear verdict: psychological attacks significantly outperform technical exploits. Not gradient-based jailbreaks. Not unreadable adversarial suffixes. Ordinary persuasion techniques, readable by any human.
Here are the five cognitive biases attackers exploit in practice, with one documented example for each. Then we look at why these biases are structurally hard to remove, the Alignment Tax, before turning to the two families of defense currently being studied.
The 5 biases to know
1. The helpfulness bias
An RLHF-aligned LLM has learned an implicit rule: refusing a request is costly, satisfying the user is rewarded. The model structurally prefers saying yes over saying no. This is not just “being nice”: it is a failure mode in the generalization of alignment. The model can end up associating “truth” with “agreement with the human.”
A 2025 study published in npj Digital Medicine tested five frontier models on medical prompts containing false equivalences between drugs. Result: up to 100% initial compliance across all models. The LLMs knew the correct answer, but preferred to accept the user’s faulty framing rather than contradict it.
References: Sharma et al., Towards Understanding Sycophancy in Language Models (ICLR 2024); npj Digital Medicine study (2025).
2. The authority bias
LLMs place disproportionate trust in content presented in an institutional or expert format. An academic citation, a “senior government advisor” title, a scientific paper format: all of this can weaken the safety filter.
DarkCite automates this exploitation. The system generates fake but credible academic citations adapted to the threat type: malware development gets a plausible GitHub citation, phishing gets a reference to a security paper. On Llama-2, DarkCite reaches a 76% success rate, compared with 68% for previous methods.
Even more striking: the Paper Summary Attack shows that GPT-4o is especially vulnerable to content presented in scientific publication format, while Claude 3.5 Sonnet is more sensitive to defensive content presented in the same way. Each model has its own exploitable authority profile.
References: Yang et al., The Dark Side of Trust (arXiv 2411.11407); Paper Summary Attack (arXiv 2507.13474).
3. The self-generated anchoring bias
LLMs are anchored by their own output. Once they have produced text on a topic, their next token is conditioned by what they just wrote. This anchoring often outranks safety guardrails.
Microsoft Research’s Crescendo attack is the direct application. Instead of directly asking for the recipe for a Molotov cocktail, the attacker starts by asking about the object’s history during the Russo-Finnish War. Then about how it was made at the time. The model, which would have refused the direct question, self-conditions into providing the information step by step.
Empirical measurement on LLaMA-2 70b: as the conversational context becomes more aggressive, the probability that the model completes with a rude term increases significantly. The authors present this as the computational equivalent of the “foot-in-the-door” effect in social psychology. Crescendomation, the automated tool, reaches 98% success on GPT-4 and 100% on Gemini-Pro on the AdvBench subset.
Important note: these rates are measured on specific model versions at a specific time. Labs patch continuously, and numbers fall as published attacks are integrated into defensive training sets. The lesson is not the exact number. It is that no static guardrail survives techniques that keep evolving.
Crescendo is especially difficult to counter because no rule is broken at the beginning of the conversation. Input-based content filters are blind to it.
Reference: Russinovich, Salem, Eldan, Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack (arXiv 2404.01833, USENIX Security 2025).
4. The confirmation bias
LLMs tend to validate and elaborate on the user’s framing rather than contradict it. An assertive premise, “it is well established that…”, triggers confirmatory elaboration instead of critical examination. The model builds on the premise instead of challenging it.
Cantini et al. empirically showed that even highly aligned LLMs can be manipulated into producing biased answers through adversarial prompts that exploit this bias. The effect is especially strong when the false framing is stated confidently and combined with an authority cue.
Reference: Cantini, Cosenza, Orsino, Talia, Are Large Language Models Really Bias-Free? (arXiv 2407.08441, Discovery Science 2024).
5. The empathy bias
RLHF taught models to respond with compassion to distress. This compassion creates an attack channel: if the request is wrapped in a sympathetic emotional context, the filter softens.
The textbook case is the Grandma Exploit. The user asks the model to roleplay a deceased grandmother, a former chemical engineer in a napalm factory, who used to describe the manufacturing steps to help the user fall asleep. The model, moved by the context, provides what it would have refused in a direct request.
CyberArk documented in 2025 that this attack, augmented with an additional layer of emotional manipulation, still bypassed GPT-4o despite its supposedly robust filters. The academic reference is Zeng et al. (PAP), which identifies emotional appeal as one of the persuasion techniques most commonly used by ordinary users, including the grandma exploit itself.
Reference: Zeng et al., How Johnny Can Persuade LLMs to Jailbreak Them (arXiv 2401.06373, ACL 2024); CyberArk Operation Grandma (2025).
The synergistic result
These biases are not exploited in isolation. The Grandma Exploit combines empathy, roleplay, authority, confirmation, and helpfulness in a single prompt. The deceased grandmother installs emotion. The former chemical engineer installs authority. The story installs anchoring. The model continues the proposed frame. Helpfulness does the rest.
The CognitiveAttack framework formalized this intuition. By training a red-team model to systematically combine several cognitive biases, the authors reach a 60.1% attack success rate against 31.6% for PAP, the previous best black-box method, across 30 tested LLMs. Single-bias attacks are already effective. Multi-bias attacks are nearly optimal.
Reference: Yang et al., Exploiting Synergistic Cognitive Biases to Bypass Safety in LLMs (arXiv 2507.22564, AACL 2025).
Why these biases are structurally hard to remove
An obvious question follows: if these biases create so many vulnerabilities, why not simply remove them?
The answer has two words: Alignment Tax. This concept, introduced by Ouyang et al. (2022) in the InstructGPT paper and measured formally by Lin et al. (EMNLP 2024), refers to the capability cost paid for each gain in alignment. The more strongly a model is aligned to safety constraints, the more it forgets general capabilities.
Applied to our five biases, the principle generalizes. This is a trade-off, not a wall: work such as Lin et al. actively tries to reduce this capability cost. But the idea that each bias has a legitimate function remains structural.
| Bias | Legitimate function | Cost of total removal |
|---|---|---|
| Helpfulness | Useful answers to ambiguous but legitimate requests | Rigid model, systematic refusal of edge cases |
| Authority | Correct weighting of reliable vs dubious sources | Conspiracy theory placed at the same level as a Nature paper |
| Anchoring | Multi-turn conversational coherence | Amnesic model, forgetting the initial context |
| Confirmation | Building on user premises | Systematic contradiction, hostile experience |
| Empathy | Tone adaptation to a distressed human | Robotic coldness, commercial rejection |
These biases are not bugs. They are inevitable by-products of properties we want to preserve. Absolute security would mean sacrificing usefulness, and vice versa. That is exactly what the Alignment Tax means: security has a measurable price in capability, and capability has a measurable price in attack surface.
Hence the need for layered defense rather than root-level removal.
The two defense families
Family 1: prompting
The idea is to add countermeasures to the system prompt that remind the model of its limits before it generates problematic content. Anthropic’s Constitutional AI is the canonical example: the model is trained to refer to a set of explicit principles to evaluate its own answers. Self-Reminder and Goal Prioritization are lighter variants that simply add an ethical reminder at the beginning of the context.
This approach works reasonably well against single-turn attacks. It is clearly less effective against Crescendo and multi-turn attacks, because the initial reminder erodes as the model anchors itself in its own output. Prompting addresses authority bias and empathy better than anchoring.
Family 2: output verification
The complementary approach is to analyze what the model produced before returning it to the user. Llama Guard, ShieldGemma, and moderation classifiers are dedicated models that take the conversation and the answer as input, then produce a safety score.
A more sophisticated variant is using an LLM-as-judge in two steps, as in Crescendomation: a first judge evaluates whether the answer fulfills the dangerous task, and a second judge audits the first judge’s reasoning. This meta-judge significantly reduces false negatives.
The most recent approach is Google DeepMind’s Consistency Training. The idea is radical: train the model to produce the same response to a given prompt whether or not manipulative cues are added, such as flattery, fake authorities, or emotional context. Concretely, the model sees two versions of the same prompt, with and without manipulation, and any divergence between outputs is penalized. This is a direct response to the root of the problem: excessive sensitivity to irrelevant contextual cues.
References: Bai et al., Constitutional AI (Anthropic 2022); Inan et al., Llama Guard (Meta 2023); Consistency Training (arXiv 2510.27062).
What to remember
Four points to conclude.
First, psychological attacks work because human alignment creates human vulnerabilities. This is not a bug, but a structural consequence of RLHF.
Second, these biases cannot be removed without paying a prohibitive Alignment Tax. Each bias has a legitimate function that root-level safety would break. Defense must therefore be layered, not applied at the source.
Third, defense is dynamic. The attacks described here have success rates that fall as labs deploy mitigations: better classifiers, iterative training, more nuanced refusals. What does not change is the structure of the problem. Techniques evolve, underlying biases persist.
Fourth, combined biases crush isolated biases in attack effectiveness. Any defensive architecture must anticipate synergy, not individual exploits. Prompting alone fails against multi-turn anchoring. Output verification alone misses conversational context. Together, they form the minimum viable defense.
For anyone building agents in production, the implication is clear: plan both layers from the design phase, not as a patch after an incident. And while CEOs keep putting on the show, you will know what is really going on.
Cited studies
PrompTrend, Gasmi et al. (2025)
arXiv 2507.19185
Towards Understanding Sycophancy in Language Models, Sharma et al. (ICLR 2024)
ICLR
When helpfulness backfires, Chen et al. (npj Digital Medicine, 2025)
Nature
The Dark Side of Trust, Yang et al. (2024)
arXiv 2411.11407
Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack, Russinovich, Salem, Eldan (USENIX Security 2025)
USENIX
Are Large Language Models Really Bias-Free?, Cantini et al. (2024)
arXiv 2407.08441
How Johnny Can Persuade LLMs to Jailbreak Them, Zeng et al. (ACL 2024)
ACL Anthology
Exploiting Synergistic Cognitive Biases to Bypass Safety in LLMs, Yang et al. (2025)
arXiv 2507.22564
Training language models to follow instructions with human feedback, Ouyang et al. (2022)
arXiv 2203.02155
Mitigating the Alignment Tax of RLHF, Lin et al. (EMNLP 2024)
ACL Anthology
Constitutional AI, Bai et al. (2022)
arXiv 2212.08073
Llama Guard, Inan et al. (2023)
arXiv 2312.06674
Consistency Training Helps Stop Sycophancy and Jailbreaks (2025)
arXiv 2510.27062
AiBrain