Leveraging LLMs with Iterative Loop Structure for Enhanced Social Intelligence in Video Question Answering

Erika Mori; Yue Qiu; Hirokatsu Kataoka; Yoshimitsu Aoki

Leveraging LLMs with Iterative Loop Structure for Enhanced Social Intelligence in Video Question Answering

Erika Mori, Yue Qiu, Hirokatsu Kataoka, Yoshimitsu Aoki

TL;DR

The paper addresses the need for socially intelligent AI capable of interpreting multimodal social cues in videos. It introduces Looped Video Debating (LVD), a looped framework that combines large language models with a multimodal VQA module to determine answerability and retrieve additional information when needed, improving transparency and reliability without fine-tuning. Evaluated on Social-IQ 2.0, LVD achieves state-of-the-art accuracy, and the authors further augment the dataset with Social-IQ Sub annotations to compare human and AI reasoning on rationale and information needs. The work demonstrates that iterative information retrieval guided by LLMs can significantly enhance video-based social understanding and highlights directions for incorporating audio and video-language models. This approach has implications for deploying more interpretable and capable socially aware AI in caregiving, education, and human–robot interaction.

Abstract

Social intelligence, the ability to interpret emotions, intentions, and behaviors, is essential for effective communication and adaptive responses. As robots and AI systems become more prevalent in caregiving, healthcare, and education, the demand for AI that can interact naturally with humans grows. However, creating AI that seamlessly integrates multiple modalities, such as vision and speech, remains a challenge. Current video-based methods for social intelligence rely on general video recognition or emotion recognition techniques, often overlook the unique elements inherent in human interactions. To address this, we propose the Looped Video Debating (LVD) framework, which integrates Large Language Models (LLMs) with visual information, such as facial expressions and body movements, to enhance the transparency and reliability of question-answering tasks involving human interaction videos. Our results on the Social-IQ 2.0 benchmark show that LVD achieves state-of-the-art performance without fine-tuning. Furthermore, supplementary human annotations on existing datasets provide insights into the model's accuracy, guiding future improvements in AI-driven social intelligence.

Leveraging LLMs with Iterative Loop Structure for Enhanced Social Intelligence in Video Question Answering

TL;DR

Abstract

Leveraging LLMs with Iterative Loop Structure for Enhanced Social Intelligence in Video Question Answering

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)