Omnimodal Language Models: Perception Versus Action

In the evolving world of AI, omnimodal large language models (LLMs) are touted as the next big leap. These models can process video, audio, and text simultaneously, promising a effortless integration of perception and action. However, recent research indicates a significant flaw in this integration. The perception-action gap is evident, particularly when these models confront a discrepancy between what they perceive and what they're prompted to believe.

Introducing the IMAVB Benchmark

The study introduces IMAVB, a 500-clip benchmark derived from long-form movies. It's structured in a 2x2 design, crossing target modalities, vision and audio, with premise conditions, both standard and misleading. This benchmark is important as it allows researchers to isolate and measure conflict detection, a fundamental aspect of true comprehension, separately from general multimodal understanding.

Through this benchmark, eight open-source omnimodal LLMs and the Gemini 3.1 Pro were analyzed. The findings were striking. Although their hidden states could reliably detect mismatches between perception and premise, their outputs often failed to reject these false claims. This discrepancy highlights a important weakness in translation from perception to action.

The Representation-Action Gap

The Representation-Action Gap, as the study terms it, manifests in two failure modes. There's under-rejection, where models process misleading questions as if the false premise were true. Then there's over-rejection, where models reject more frequently, even standard questions, resulting in lower comprehension accuracy. Notably, the data shows an asymmetry in modality: audio grounding consistently underperforms compared to vision.

Why does this gap matter? Models that can't appropriately discern and act on conflicting information pose significant risks. In real-world applications, from automated customer support to advanced robotics, such failures could lead to costly errors or even safety hazards.

Addressing the Bottleneck

The research doesn't just stop at identifying the problem. An initial diagnostic intervention, probe-guided logit adjustment (PGLA), shows promise. By re-injecting the encoded mismatch signal into decoding, PGLA consistently improves the models' rejection behaviors. This suggests the bottleneck isn't in perception at all, but rather in the translation process.

Western coverage has largely overlooked this nuanced yet critical aspect of omnimodal AI development. As these models become more integrated into everyday applications, understanding and addressing these gaps will be imperative. The benchmark results speak for themselves. It’s not just about processing inputs. It’s about making the right decisions based on them. Isn't that what true intelligence should be about?