Table of Contents
Fetching ...

MimeQA: Towards Socially-Intelligent Nonverbal Foundation Models

Hengzhi Li, Megan Tjandrasuwita, Yi R. Fung, Armando Solar-Lezama, Paul Pu Liang

TL;DR

MimeQA introduces a nonverbal social reasoning benchmark using mime videos to stress-test multimodal foundation models beyond language-centric evaluation. The dataset comprises 101 mime videos with 806 QA pairs across grounding, scene-level, and global-level reasoning, revealing that current VideoLLMs struggle with imagined-object grounding and nuanced social cues. Finetuning on MimeQA yields transferable gains to other social tasks, while pose-based or language-context augmentation yields mixed results, underscoring grounding as a persistent challenge. The work highlights the need for true multimodal alignment and broader nonverbal understanding in future socially intelligent AI systems.

Abstract

As AI becomes more closely integrated with peoples' daily activities, socially intelligent AI that can understand and interact seamlessly with humans in daily lives is increasingly important. However, current works in AI social reasoning all rely on language-only or language-dominant approaches to benchmark and training models, resulting in systems that are improving in verbal communication but struggle with nonverbal social understanding. To address this limitation, we tap into a novel data source rich in nonverbal social interactions -- mime videos. Mimes refer to the art of expression through gesture and movement without spoken words, which presents unique challenges and opportunities in interpreting nonverbal social communication. We contribute a new dataset called MimeQA, obtained by sourcing ~8 hours of videos clips from YouTube and developing a comprehensive video question-answering benchmark comprising 806 carefully annotated and verified question-answer pairs, designed to probe nonverbal social reasoning capabilities. Using MimeQA, we evaluate state-of-the-art video large language models (VideoLLMs) and find that they achieve low accuracy, generally ranging from 20-30%, while humans score 86%. Our analysis reveals that VideoLLMs often fail to ground imagined objects and over-rely on the text prompt while ignoring subtle nonverbal interactions. We hope to inspire future work in AI models that embody true social intelligence capable of interpreting non-verbal human interactions.

MimeQA: Towards Socially-Intelligent Nonverbal Foundation Models

TL;DR

MimeQA introduces a nonverbal social reasoning benchmark using mime videos to stress-test multimodal foundation models beyond language-centric evaluation. The dataset comprises 101 mime videos with 806 QA pairs across grounding, scene-level, and global-level reasoning, revealing that current VideoLLMs struggle with imagined-object grounding and nuanced social cues. Finetuning on MimeQA yields transferable gains to other social tasks, while pose-based or language-context augmentation yields mixed results, underscoring grounding as a persistent challenge. The work highlights the need for true multimodal alignment and broader nonverbal understanding in future socially intelligent AI systems.

Abstract

As AI becomes more closely integrated with peoples' daily activities, socially intelligent AI that can understand and interact seamlessly with humans in daily lives is increasingly important. However, current works in AI social reasoning all rely on language-only or language-dominant approaches to benchmark and training models, resulting in systems that are improving in verbal communication but struggle with nonverbal social understanding. To address this limitation, we tap into a novel data source rich in nonverbal social interactions -- mime videos. Mimes refer to the art of expression through gesture and movement without spoken words, which presents unique challenges and opportunities in interpreting nonverbal social communication. We contribute a new dataset called MimeQA, obtained by sourcing ~8 hours of videos clips from YouTube and developing a comprehensive video question-answering benchmark comprising 806 carefully annotated and verified question-answer pairs, designed to probe nonverbal social reasoning capabilities. Using MimeQA, we evaluate state-of-the-art video large language models (VideoLLMs) and find that they achieve low accuracy, generally ranging from 20-30%, while humans score 86%. Our analysis reveals that VideoLLMs often fail to ground imagined objects and over-rely on the text prompt while ignoring subtle nonverbal interactions. We hope to inspire future work in AI models that embody true social intelligence capable of interpreting non-verbal human interactions.

Paper Structure

This paper contains 38 sections, 11 figures, 8 tables.

Figures (11)

  • Figure 1: MimeQA is a new benchmark testing nonverbal social reasoning in multimodal large language models, with 101 videos of mimes (the art of expression through gesture without spoken words), and 806 question-answer pairs at three levels: 1) grounding the imagined object or activity, 2) scene-level understanding, and 3) global-level questions on holistic social comprehension. Most models achieve only 20-30% accuracy.
  • Figure 2: Examples of MimeQA question types.Left: Grounding the imagined questions includes recognizing the activity or pretend object that the mime is acting out. Top right: Scene-level questions include temporal reasoning about a localized sequence of events, affect recognition questions about the emotional state of the characters, and intention and behavior questions that require interpreting the goals and motivations within a scene. Bottom right: Global-level questions involve working memory questions that probe understanding of the plot beyond localized sequences, social judgment questions about how the characters' actions adhere to cultural and social norms, and theory of mind questions about the characters' beliefs, desires, and motivation.
  • Figure 3: Dataset construction pipeline: 1) Collecting videos from YouTube with various search terms that are summarized by the word cloud. 2) Annotating approximately 6 grounding and scene-level questions and 4 global-level questions per video, removing 120 videos in the process. 3) Verifying the annotated questions and answers, with 97.58% verifier agreement.
  • Figure 4: MimeQA dataset statistics. Distribution of video lengths shows the range of short to long timescales. The distribution of the number of questions per video shows that each video is densely annotated, and the distribution of the number of questions per category is balanced.
  • Figure 5: Transfer analysis between MimeQA and Social-IQ 2.0 siq2. Models fine-tuned on MimeQA consistently generalize well to Social-IQ 2.0, while training on Social-IQ 2.0 yields little to no gains on MimeQA. This highlights the distinct nonverbal social reasoning required in MimeQA that is transferrable to other tasks.
  • ...and 6 more figures