HiddenBench Exposes the Blind Spots in Multi-Agent Language Models
HiddenBench reveals that multi-agent systems struggle with information asymmetry, achieving just 30.1% accuracy, far below single-agent performance.
Multi-agent systems, built on large language models (LLMs), are often expected to enhance decision-making by amalgamating distributed information. However, a recent benchmark called HiddenBench has cast doubt on this assumption. HiddenBench, grounded in the Hidden Profile paradigm, focuses on collective reasoning under distributed information, exposing a significant gap between multi-agent and single-agent performance.
HiddenBench Unveils the Numbers
The findings are stark. Multi-agent LLMs achieve a mere 30.1% accuracy when processing distributed information. This pales in comparison to the 80.7% accuracy single agents achieve when given complete information. The gap is attributed to a systematic failure mode: multi-agent systems struggle to recognize or act on latent information asymmetry. They fail to consider what others might know but haven't yet shared, leading to premature convergence on shared evidence while critical distributed facts remain unexplored.
Scaling Challenges
Interestingly, these failures persist across various prompting strategies, communication depths, and group sizes, worsening as groups scale. While some models like Gemini-2.5-Flash/Pro outperform others, neither the scale of the model nor individual reasoning accuracy reliably predicts collective performance. This raises a key question: Are bigger models truly better in collaborative setups?
Actionable Insights
Despite the bleak outlook, the study offers a silver lining. A lightweight structured communication protocol has shown to substantially improve collective reasoning across model families. This finding not only highlights a key limitation in current multi-agent LLMs but also provides a theory-grounded framework for diagnosing collective reasoning failures.
The AI-AI Venn diagram is getting thicker. As we build more complex agentic systems, understanding the limitations of multi-agent LLMs becomes imperative. If agents have wallets, who holds the keys to effective communication? These models' inability to handle distributed information reflects a broader challenge in AI development. It's not just about making models bigger and faster. it's about ensuring they can collaborate and reason effectively under real-world constraints.
Get AI news in your inbox
Daily digest of what matters in AI.
Key Terms Explained
A standardized test used to measure and compare AI model performance.
Google's flagship multimodal AI model family, developed by Google DeepMind.
The text input you give to an AI model to direct its behavior.
The ability of AI models to draw conclusions, solve problems logically, and work through multi-step challenges.