MimeQA: Towards Socially-Intelligent Nonverbal Foundation Models

Hengzhi Li; Megan Tjandrasuwita; Yi R. Fung; Armando Solar-Lezama; Paul Pu Liang

MimeQA: Towards Socially-Intelligent Nonverbal Foundation Models

Hengzhi Li, Megan Tjandrasuwita, Yi R. Fung, Armando Solar-Lezama, Paul Pu Liang

TL;DR

MimeQA introduces a nonverbal social reasoning benchmark using mime videos to stress-test multimodal foundation models beyond language-centric evaluation. The dataset comprises 101 mime videos with 806 QA pairs across grounding, scene-level, and global-level reasoning, revealing that current VideoLLMs struggle with imagined-object grounding and nuanced social cues. Finetuning on MimeQA yields transferable gains to other social tasks, while pose-based or language-context augmentation yields mixed results, underscoring grounding as a persistent challenge. The work highlights the need for true multimodal alignment and broader nonverbal understanding in future socially intelligent AI systems.

Abstract

As AI becomes more closely integrated with peoples' daily activities, socially intelligent AI that can understand and interact seamlessly with humans in daily lives is increasingly important. However, current works in AI social reasoning all rely on language-only or language-dominant approaches to benchmark and training models, resulting in systems that are improving in verbal communication but struggle with nonverbal social understanding. To address this limitation, we tap into a novel data source rich in nonverbal social interactions -- mime videos. Mimes refer to the art of expression through gesture and movement without spoken words, which presents unique challenges and opportunities in interpreting nonverbal social communication. We contribute a new dataset called MimeQA, obtained by sourcing ~8 hours of videos clips from YouTube and developing a comprehensive video question-answering benchmark comprising 806 carefully annotated and verified question-answer pairs, designed to probe nonverbal social reasoning capabilities. Using MimeQA, we evaluate state-of-the-art video large language models (VideoLLMs) and find that they achieve low accuracy, generally ranging from 20-30%, while humans score 86%. Our analysis reveals that VideoLLMs often fail to ground imagined objects and over-rely on the text prompt while ignoring subtle nonverbal interactions. We hope to inspire future work in AI models that embody true social intelligence capable of interpreting non-verbal human interactions.

MimeQA: Towards Socially-Intelligent Nonverbal Foundation Models

TL;DR

Abstract

MimeQA: Towards Socially-Intelligent Nonverbal Foundation Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (11)