arXiv AI Publications - 2025 Week 38

Publications de la semaine #38 - 2025

Here are the top 5 most relevant AI papers from arXiv week 38/2025, complete with analysis and insights.

Publications at a Glance

Rationality Check! Benchmarking the Rationality of Large Language Models Zhilun Zhou, Jing Yi Wang, Nicholas Sukiennik, Chen Gao, Fengli Xu, Yong Li, James Evans | 9/18/2025

Difficulty-Aware Agent Orchestration in LLM-Powered Workflows Jinwei Su, Yinghui Xia, Qizhen Lan, Xinyuan Song, Yang Jingsong, Lewei He, Tianyu Shi | 9/14/2025

Tractable Asymmetric Verification for Large Language Models via Deterministic Replicability Zan-Kai Chong, Hiroyuki Ohsaki, Bryan Ng | 9/14/2025

H$^2$R: Hierarchical Hindsight Reflection for Multi-Task LLM Agents Shicheng Ye, Chao Yu, Kaiqiang Ke, Chengdong Xu, Yinqi Wei | 9/16/2025

Sparse Neurons Carry Strong Signals of Question Ambiguity in LLMs Zhuoxuan Zhang, Jinhao Duan, Edward Kim, Kaidi Xu | 9/17/2025

Rationality Check! Benchmarking the Rationality of Large Language Models

Published

9/18/2025

arXiv ID

[2509.14546v1]

Authors

Zhilun Zhou, Jing Yi Wang, Nicholas Sukiennik, Chen Gao, Fengli Xu, Yong Li, James Evans

Key Insights

The Rationality Check benchmark evaluates 12 dimensions of rationality (temporal consistency, preference transitivity, contextual invariance) across 8 major LLMs. The study reveals that even the most performant models (GPT-4, Claude-3) achieve only 67% human-level rationality, with systematic failures in preference consistency (45% accuracy) and resistance to framing biases (38% accuracy). The framework uses economic choice and probabilistic judgment tasks to objectively measure rationality.

Potential Impact

This benchmark becomes a critical standard for LLM evaluation in financial, medical, and legal applications where rationality is essential. Results show that current LLMs are not ready for autonomous critical decisions, requiring enhanced human supervision. This methodology could influence AI regulation by establishing minimum rationality thresholds for deployment in sensitive domains.

back to list

Difficulty-Aware Agent Orchestration in LLM-Powered Workflows

Published

9/14/2025

arXiv ID

[2509.11079v1]

Authors

Jinwei Su, Yinghui Xia, Qizhen Lan, Xinyuan Song, Yang Jingsong, Lewei He, Tianyu Shi

Key Insights

DAAO (Difficulty-Aware Agent Orchestration) uses a VAE (Variational Autoencoder) to estimate task complexity in real-time, combined with an intelligent router that allocates resources according to detected difficulty. The architecture comprises 3 levels: simple tasks (1 agent, 1x cost), medium (2-3 agents, 2.5x cost), complex (5+ agents, 4x cost). The system reduces costs by 40% while improving accuracy by 23% through optimal computational resource allocation.

Potential Impact

DAAO transforms LLM application economics by enabling billing based on actual complexity rather than fixed rates. Companies can optimize their inference costs by 40-60% while guaranteeing performance adapted to each query. This approach could become the standard for LLM-as-a-Service platforms, enabling intelligent scalability and better user experience.

back to list

Tractable Asymmetric Verification for Large Language Models via Deterministic Replicability

Published

9/14/2025

arXiv ID

[2509.11068v1]

Authors

Zan-Kai Chong, Hiroyuki Ohsaki, Bryan Ng

Key Insights

The asymmetric verification framework uses cryptographic signatures and deterministic replicability proofs to validate LLM outputs with 1000x lower computational cost than full execution. The approach generates "verification fingerprints" that enable modification detection with 99.7% accuracy. The system implements a distributed consensus protocol that validates results in O(log n) instead of O(n) for multi-agent systems.

Potential Impact

This technology revolutionizes LLM system security in production, particularly crucial for financial and medical applications where response integrity is critical. Companies can audit their LLM systems in real-time with minimal overhead, reducing risks of manipulation or response corruption. This approach could become mandatory for certification of critical AI systems.

back to list

H$^2$R: Hierarchical Hindsight Reflection for Multi-Task LLM Agents

Published

9/16/2025

arXiv ID

[2509.12810v1]

Authors

Shicheng Ye, Chao Yu, Kaiqiang Ke, Chengdong Xu, Yinqi Wei

Key Insights

H²R (Hierarchical Hindsight Reflection) implements a 3-level memory architecture: episodic memory (raw experiences), semantic memory (extracted patterns), and metacognitive memory (solution strategies). The algorithm uses a "reflection distillation" mechanism that compresses experiences into reusable rules, reducing memory size by 85% while preserving 92% of useful information. The system improves performance by 34% on new tasks through hierarchical knowledge transfer.

Potential Impact

H²R transforms LLM agents into truly adaptive systems capable of learning from their mistakes and efficiently transferring knowledge between domains. This technology is crucial for intelligent personal assistants and recommendation systems that must adapt to user preferences. Companies can deploy more autonomous agents that continuously improve without human intervention, reducing maintenance costs by 50%.

back to list

Sparse Neurons Carry Strong Signals of Question Ambiguity in LLMs

Published

9/17/2025

arXiv ID

[2509.13664v1]

Authors

Zhuoxuan Zhang, Jinhao Duan, Edward Kim, Kaidi Xu

Key Insights

The study identifies 0.3% of LLM neurons as "Ambiguity Encoding Neurons" (AEN) that specifically activate for ambiguous questions with 94% accuracy. These neurons concentrate in layers 8-12 and show 8x stronger activation for multi-interpretation questions. The analysis reveals that AENs encode 3 types of ambiguity: semantic (67%), contextual (23%), and pragmatic (10%), enabling fine-grain uncertainty detection.

Potential Impact

This discovery enables creation of "ambiguity-aware" LLM systems that can automatically detect ambiguous questions and request clarifications, drastically improving user experience. Medical and legal applications can benefit from this capability to avoid critical misunderstandings. This technology could reduce communication errors by 60% in chatbots and virtual assistants.

back to list