HiddenBench Exposes the Blind Spots in Multi-Agent...

Multi-agent systems, built on large language models (LLMs), are often expected to enhance decision-making by amalgamating distributed information. However, a recent benchmark called HiddenBench has cast doubt on this assumption. HiddenBench, grounded in the Hidden Profile paradigm, focuses on collective reasoning under distributed information, exposing a significant gap between multi-agent and single-agent performance.

HiddenBench Unveils the Numbers

The findings are stark. Multi-agent LLMs achieve a mere 30.1% accuracy when processing distributed information. This pales in comparison to the 80.7% accuracy single agents achieve when given complete information. The gap is attributed to a systematic failure mode: multi-agent systems struggle to recognize or act on latent information asymmetry. They fail to consider what others might know but haven't yet shared, leading to premature convergence on shared evidence while critical distributed facts remain unexplored.

Scaling Challenges

Interestingly, these failures persist across various prompting strategies, communication depths, and group sizes, worsening as groups scale. While some models like Gemini-2.5-Flash/Pro outperform others, neither the scale of the model nor individual reasoning accuracy reliably predicts collective performance. This raises a key question: Are bigger models truly better in collaborative setups?

Actionable Insights

Despite the bleak outlook, the study offers a silver lining. A lightweight structured communication protocol has shown to substantially improve collective reasoning across model families. This finding not only highlights a key limitation in current multi-agent LLMs but also provides a theory-grounded framework for diagnosing collective reasoning failures.

The AI-AI Venn diagram is getting thicker. As we build more complex agentic systems, understanding the limitations of multi-agent LLMs becomes imperative. If agents have wallets, who holds the keys to effective communication? These models' inability to handle distributed information reflects a broader challenge in AI development. It's not just about making models bigger and faster. it's about ensuring they can collaborate and reason effectively under real-world constraints.

HiddenBench Exposes the Blind Spots in Multi-Agent Language Models

HiddenBench Unveils the Numbers

Scaling Challenges

Actionable Insights

Key Terms Explained