LLMs Put to the Test - Knowledge,...

LLMs Facing the tests

Executive Summary: AI Benchmarks vs Real Intelligence

Main Finding: Results across different benchmarks reveal two types of capabilities in models: one where they excel, another where results are faltering.

We have built extraordinary memory amplifiers already indispensable for real work. Aggregate performance indices accurately reflect this practical utility. For abstract thinking in uncharted territory, AI remains far behind humans. The future requires combining memory with robust generalization—a leap that hasn’t yet occurred.

Glossary: Understanding Technical Terms

Benchmark (reference test): Imagine a standardized exam like the SAT or TOEFL, but for AI. It’s a set of identical questions or tasks for all models, allowing objective comparison. Examples: MMLU tests general knowledge across 57 subjects, AIME poses competition-level math problems, LiveCodeBench requires writing code that actually works.
Contamination (of training data): This is when a student has already seen the exam questions before taking it. For AI, this happens because they’re trained on gigantic amounts of web text—and benchmark test questions are often found there, discussed in forums, articles, and code notebooks. Result: the model can “recognize” a question rather than truly solve it. This isn’t intentional cheating, but an inevitable consequence of their learning method.
Fluid intelligence: A term from cognitive psychology. It’s your ability to solve a completely new problem you’ve never encountered, without relying on your knowledge. For example: understanding the rules of a new game by watching three rounds, or solving a novel logic puzzle. It’s the opposite of “crystallized” intelligence (your accumulated knowledge). Children excel at fluid intelligence—that’s why they learn so quickly—but this capacity tends to decline with age, while crystallized intelligence continues to grow. Paradoxically, our current AIs resemble expert adults more: lots of crystallized knowledge, little fluid intelligence.
Heuristic (rule of thumb): A mental shortcut that often works but not always. Example in a multiple-choice test: “the longest answer is often correct” or “if two answers contradict each other, the true one is probably one of them.” AIs develop sophisticated heuristics from being exposed to millions of examples.
Out-of-distribution reasoning: “Distribution” here refers to the set of examples seen during training. “Out-of-distribution” therefore means: completely different from everything the model has encountered. It’s like asking someone who only learned addition and subtraction to perform integration: they have no foothold. ARC-AGI 2 precisely tests this capability, and current AIs largely fail at it.
Reasoning model (extended reasoning): A new generation of AI (like OpenAI’s o3 or certain Grok 4 configurations) that doesn’t respond immediately. Instead, the model “thinks” for several seconds or minutes, breaking down the problem step by step, as we would on scratch paper. This improves results on complex problems but remains resource-intensive.

We live in a strange era. Artificial intelligences pass medical exams, solve competition-worthy math problems, generate functional code in seconds. Yet, faced with an abstract puzzle a child would solve intuitively, they collapse. This contradiction says something profound about what we’re truly measuring when evaluating machines and what we call “intelligence.”

For the general public and professionals alike, platforms like Artificial Analysis have become the reference. Their Intelligence Index (AI²) aggregates performance across test batteries (knowledge, math, code, instruction following, long context) and ranks Claude, GPT, Grok, Gemini, or Llama in neat leaderboards. Useful for choosing a model, yes. But these scores mainly capture the ability to mobilize vast memory, not necessarily the faculty to think in the unknown.

The Illusion of Grand Exams

Since GPT-3, “benchmark culture” has taken hold. MMLU tests knowledge across 57 disciplines; GPQA Diamond pushes scientific expertise to PhD level; AIME and GSM8K stress-test mathematical reasoning chains; LiveCodeBench and SciCode judge code through unit tests. Result: spectacular numbers. Claude 4.5 approaches top scores on GPQA Diamond; Grok 4 shows visible progress on reasoning benchmarks; “reasoning” models like o3 demonstrate that extended reflection helps.

But what do these victories measure? Essentially, two things:

encyclopedic memory (retrieving, combining, reformulating already-seen knowledge),
effective local heuristics (elimination, unit consistency, proof patterns).

A very good score indicates a model knows how to apply what it has absorbed—not that it knows how to invent a new rule.

Contamination: Original Sin… and Hidden Virtue

Contamination (test items already seen during training) is often accused of skewing evaluations. This is true for many public tests, discussed in thousands of notebooks and papers. But this “contamination” isn’t a bug: it’s the primary learning mode of LLMs. They become useful because they’ve read everything.

And this is precisely what makes them brilliant at programming. A developer doesn’t invent a parser every morning: they reuse patterns, APIs, snippets. LLMs do the same, in turbo mode. Pattern recognition, adapting existing solutions, respecting conventions: that’s why they debug, refactor, write tests, and save real time. Same for scientific literature: “having read everything” allows connecting ideas that an isolated human would take weeks to bring together.

The problem isn’t contamination; it’s the confusion between this skill (super useful) and abstract reasoning. The latter consists of discovering a rule from few examples and generalizing beyond the program. There, memory is no longer enough.

ARC-AGI 2: The Test You Can’t Cram For

This is the entire spirit of ARC, conceived by François Chollet: small visual puzzles where you must induce a hidden rule from a few examples and apply it to a new case. The first version (public) ended up “known” to models. Then came ARC-AGI 2: novel puzzles, procedurally generated, closed evaluation. Impossible to cram.

The verdict is clear. Humans: ~85–90%. GPT-4-like: ~0–3%. Even high-end “reasoning” models (o3, Grok 4 with heavy configurations) remain weak in absolute terms. In short, as soon as reasoning out of distribution is required, machines stumble.

A Glimmer of Progress

It would be exaggerated to say AIs aren’t advancing at all in reasoning. Models like OpenAI’s o3 or Grok 4 “Heavy” now achieve between 10 and 16% on ARC-AGI 2, where GPT-4-like remained stuck at 0–3%. This is still far from the 85–90% human performance, but it’s a notable improvement. It doesn’t reflect emergent fluid intelligence, but shows that by extending reflection loops or orchestrating multi-step agents, we can simulate abstraction slightly better.

This doesn’t change the conclusion: AIs remain primarily memory amplifiers, but the first stones of more robust reasoning are appearing.

The Useful Paradox

There are thus two intelligences to distinguish. Practical intelligence, which quickly and effectively mobilizes a large stock of knowledge and recipes: this is where LLMs excel, and what indices like Artificial Analysis capture. And abstract intelligence, which invents rules in the unknown: this is what ARC-AGI 2 measures, where models remain, for now, very far behind us.

The future? Probably a reconciliation: keeping memory (indispensable in real use), adding robust generalization. The first building blocks exist (extended reasoning, agents, neuro-symbolic hybrids), but the qualitative leap hasn’t yet occurred.

Conclusion

We have built extraordinary memory amplifiers. They are already indispensable—especially for code, synthesis, documentary research—and aggregate rankings like the Intelligence Index accurately reflect this utility. But for abstract thinking in virgin territory, the mirror of ARC-AGI 2 reflects a more humble image. This is less a disappointment than a compass: knowing where AIs are strong, where they are not, and what to build next.

Appendix – Benchmarks and Scores 2024–2025

It’s important to note that most figures were reported by the publishing companies themselves, so vigilance is warranted.

MMLU / MMLU-Pro

Multiple-choice questions covering 57 academic disciplines (sciences, law, history, etc.). Measures encyclopedic breadth and application of learned concepts.

Model / Humans	Score
Humans (PhD, experts)	~89.8%
GPT-4.1	90.2%
GPT-4o	88.7%
GPT-4o mini	82.0%
Claude 3.5 Sonnet	~88%
Grok-1.5	81.3%
o3 / o4 (OpenAI)	~85%

AIME / GSM8K

Competition-level math problems (AIME = advanced high school, GSM8K = middle school level). Measure multi-step reasoning capability.

Model / Humans	Score
Humans (competitive students)	highly variable
GPT-4o mini (MGSM)	87.0%
Claude Sonnet 4.5	~100% (AIME 2025, with Python)
Grok 4	90–95%
Grok 4 Heavy (AIME 2025)	100%
o3 (OpenAI)	88.9%
o4-mini (OpenAI)	92.7% (without tools) / 99.5% (with Python)

GPQA Diamond

PhD-level scientific questions, designed to be difficult to Google.

Model / Humans	Score
Humans (PhD experts)	69.7%
GPT-4o	53.6%
GPT-4.1 nano	50.3%
Claude Sonnet 4.5	83.4%
Grok 4	87.5%
Grok 4 Heavy	88.9%
o3 (OpenAI)	87.7%

LiveCodeBench / SWE-bench Verified

Code generation benchmarks automatically validated by unit tests.

Model / Humans	Score
Humans (experienced developers)	>90%
GPT-4.1	54.6%
o3 (OpenAI)	69.1%
Claude Sonnet 4.5	77.2% (82% with parallelization)
Grok 4	75%
Grok 4 (with Python)	79.3%

ARC-AGI 1

Visual abstraction puzzles (public set). Exposed to contamination, so performance should be interpreted with caution.

Model / Humans	Score
Humans (average)	64.2%
Humans (best solvers)	~85–100%
GPT-4	7%
GPT-4o	~4–5% (pure) / ~50% (with Python program generation)
Claude 4.5	~20–40% (est.)
Grok 4	66.7%
o3 (OpenAI)	75.7% (normal budget) / 87.5% (high compute)

ARC-AGI 2

Novel visual puzzles, procedurally generated and closed. Designed to resist contamination.

Model / Humans	Score
Humans (live study)	~60% (average), 85–90% (expert solvers)
GPT-4-like	0–3%
Claude Opus 4	8.6%
Grok 4	15.9%
Grok 4 Heavy	16.2%
o3 (OpenAI)	6–7%
GPT-5 (Medium, preview)	~7.5%