Table of Contents
Fetching ...

Can MLLMs Read the Room? A Multimodal Benchmark for Verifying Truthfulness in Multi-Party Social Interactions

Caixin Kang, Yifei Huang, Liangyang Ouyang, Mingfang Zhang, Yoichi Sato

TL;DR

This work formalizes the Multimodal Interactive Veracity Assessment (MIVA) task to evaluate how well Multimodal Large Language Models ground veracity judgments in dynamic, multi-party social interactions. It introduces a Werewolf-based, ground-truth dataset with synchronized video and text, augmented by a semi-automated, LLM-assisted annotation pipeline. A comprehensive benchmark across multiple state-of-the-art MLLMs reveals substantial gaps in grounding visual social cues, theory of mind, and high-stakes decision making, even for models like GPT-4o. The findings motivate future directions toward context-aware alignment, architectures capable of Theory of Mind reasoning, and stronger language-grounding in non-verbal cues to achieve trustworthy, perceptive AI.

Abstract

As AI systems become increasingly integrated into human lives, endowing them with robust social intelligence has emerged as a critical frontier. A key aspect of this intelligence is discerning truth from deception, a ubiquitous element of human interaction that is conveyed through a complex interplay of verbal language and non-verbal visual cues. However, automatic deception detection in dynamic, multi-party conversations remains a significant challenge. The recent rise of powerful Multimodal Large Language Models (MLLMs), with their impressive abilities in visual and textual understanding, makes them natural candidates for this task. Consequently, their capabilities in this crucial domain are mostly unquantified. To address this gap, we introduce a new task, Multimodal Interactive Veracity Assessment (MIVA), and present a novel multimodal dataset derived from the social deduction game Werewolf. This dataset provides synchronized video, text, with verifiable ground-truth labels for every statement. We establish a comprehensive benchmark evaluating state-of-the-art MLLMs, revealing a significant performance gap: even powerful models like GPT-4o struggle to distinguish truth from falsehood reliably. Our analysis of failure modes indicates that these models fail to ground language in visual social cues effectively and may be overly conservative in their alignment, highlighting the urgent need for novel approaches to building more perceptive and trustworthy AI systems.

Can MLLMs Read the Room? A Multimodal Benchmark for Verifying Truthfulness in Multi-Party Social Interactions

TL;DR

This work formalizes the Multimodal Interactive Veracity Assessment (MIVA) task to evaluate how well Multimodal Large Language Models ground veracity judgments in dynamic, multi-party social interactions. It introduces a Werewolf-based, ground-truth dataset with synchronized video and text, augmented by a semi-automated, LLM-assisted annotation pipeline. A comprehensive benchmark across multiple state-of-the-art MLLMs reveals substantial gaps in grounding visual social cues, theory of mind, and high-stakes decision making, even for models like GPT-4o. The findings motivate future directions toward context-aware alignment, architectures capable of Theory of Mind reasoning, and stronger language-grounding in non-verbal cues to achieve trustworthy, perceptive AI.

Abstract

As AI systems become increasingly integrated into human lives, endowing them with robust social intelligence has emerged as a critical frontier. A key aspect of this intelligence is discerning truth from deception, a ubiquitous element of human interaction that is conveyed through a complex interplay of verbal language and non-verbal visual cues. However, automatic deception detection in dynamic, multi-party conversations remains a significant challenge. The recent rise of powerful Multimodal Large Language Models (MLLMs), with their impressive abilities in visual and textual understanding, makes them natural candidates for this task. Consequently, their capabilities in this crucial domain are mostly unquantified. To address this gap, we introduce a new task, Multimodal Interactive Veracity Assessment (MIVA), and present a novel multimodal dataset derived from the social deduction game Werewolf. This dataset provides synchronized video, text, with verifiable ground-truth labels for every statement. We establish a comprehensive benchmark evaluating state-of-the-art MLLMs, revealing a significant performance gap: even powerful models like GPT-4o struggle to distinguish truth from falsehood reliably. Our analysis of failure modes indicates that these models fail to ground language in visual social cues effectively and may be overly conservative in their alignment, highlighting the urgent need for novel approaches to building more perceptive and trustworthy AI systems.

Paper Structure

This paper contains 25 sections, 4 figures, 5 tables.

Figures (4)

  • Figure 1: The MIVA Task Annotation Process. Starting with existing data and game metadata, we manually annotated the "night actions." An automated, LLM-assisted pipeline then created a new multimodal MIVA dataset from the Werewolf game.
  • Figure 2: Overview of the semi-automated annotation prompt. The LLM is tasked to act as an expert analyst, following a strict workflow to produce verifiable veracity labels and explanations. The full prompt is in Appendix A.1.
  • Figure 3: Summary of the MLLM evaluation prompt. The model receives comprehensive context and is tasked with a hierarchical analysis of both persuasive strategy and veracity, with a required structured JSON output.
  • Figure 4: Radar Chart of Models’ Accuracy in the MIVA task across persuasive strategy categories in two datasets.