Table of Contents
Fetching ...

Leveraging LLMs with Iterative Loop Structure for Enhanced Social Intelligence in Video Question Answering

Erika Mori, Yue Qiu, Hirokatsu Kataoka, Yoshimitsu Aoki

TL;DR

The paper addresses the need for socially intelligent AI capable of interpreting multimodal social cues in videos. It introduces Looped Video Debating (LVD), a looped framework that combines large language models with a multimodal VQA module to determine answerability and retrieve additional information when needed, improving transparency and reliability without fine-tuning. Evaluated on Social-IQ 2.0, LVD achieves state-of-the-art accuracy, and the authors further augment the dataset with Social-IQ Sub annotations to compare human and AI reasoning on rationale and information needs. The work demonstrates that iterative information retrieval guided by LLMs can significantly enhance video-based social understanding and highlights directions for incorporating audio and video-language models. This approach has implications for deploying more interpretable and capable socially aware AI in caregiving, education, and human–robot interaction.

Abstract

Social intelligence, the ability to interpret emotions, intentions, and behaviors, is essential for effective communication and adaptive responses. As robots and AI systems become more prevalent in caregiving, healthcare, and education, the demand for AI that can interact naturally with humans grows. However, creating AI that seamlessly integrates multiple modalities, such as vision and speech, remains a challenge. Current video-based methods for social intelligence rely on general video recognition or emotion recognition techniques, often overlook the unique elements inherent in human interactions. To address this, we propose the Looped Video Debating (LVD) framework, which integrates Large Language Models (LLMs) with visual information, such as facial expressions and body movements, to enhance the transparency and reliability of question-answering tasks involving human interaction videos. Our results on the Social-IQ 2.0 benchmark show that LVD achieves state-of-the-art performance without fine-tuning. Furthermore, supplementary human annotations on existing datasets provide insights into the model's accuracy, guiding future improvements in AI-driven social intelligence.

Leveraging LLMs with Iterative Loop Structure for Enhanced Social Intelligence in Video Question Answering

TL;DR

The paper addresses the need for socially intelligent AI capable of interpreting multimodal social cues in videos. It introduces Looped Video Debating (LVD), a looped framework that combines large language models with a multimodal VQA module to determine answerability and retrieve additional information when needed, improving transparency and reliability without fine-tuning. Evaluated on Social-IQ 2.0, LVD achieves state-of-the-art accuracy, and the authors further augment the dataset with Social-IQ Sub annotations to compare human and AI reasoning on rationale and information needs. The work demonstrates that iterative information retrieval guided by LLMs can significantly enhance video-based social understanding and highlights directions for incorporating audio and video-language models. This approach has implications for deploying more interpretable and capable socially aware AI in caregiving, education, and human–robot interaction.

Abstract

Social intelligence, the ability to interpret emotions, intentions, and behaviors, is essential for effective communication and adaptive responses. As robots and AI systems become more prevalent in caregiving, healthcare, and education, the demand for AI that can interact naturally with humans grows. However, creating AI that seamlessly integrates multiple modalities, such as vision and speech, remains a challenge. Current video-based methods for social intelligence rely on general video recognition or emotion recognition techniques, often overlook the unique elements inherent in human interactions. To address this, we propose the Looped Video Debating (LVD) framework, which integrates Large Language Models (LLMs) with visual information, such as facial expressions and body movements, to enhance the transparency and reliability of question-answering tasks involving human interaction videos. Our results on the Social-IQ 2.0 benchmark show that LVD achieves state-of-the-art performance without fine-tuning. Furthermore, supplementary human annotations on existing datasets provide insights into the model's accuracy, guiding future improvements in AI-driven social intelligence.

Paper Structure

This paper contains 14 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Two samples from the Social-IQ 2.0 dataset siq2. Each video, depicting human interactions, is accompanied by approximately six questions that require advanced reasoning. For each question, one correct answer (green) and three incorrect answers (red) are provided.
  • Figure 2: Proposed method (LVD). In this method, the model first determines whether the question is answerable based on 10 images (or captions, in the case of GPT-4 and Llama) and the dialogue information (blue). If the question is considered answerable, the option inferred to be correct is output (the green arrow). If the question is deemed unanswerable, a loop structure is employed to obtain additional information (red dashed arrows). This additional information is then added to the original input, and the QA process is repeated (red solid arrows).
  • Figure 3: Result example. In the "Question and options:" section, red options indicate incorrect answers, while green options indicate correct ones. In this case, the first attempt resulted in "Unanswerable", but the correct answer was later produced using additional information from the VQA model. Under "First attempt:", the response, the additional information inferred by the LLM, and the corresponding video timestamps are recorded. Under "Acquired additional information:", the frame retrieved based on the predicted timestamps and the visual details obtained by the VQA from the frame is documented.
  • Figure 4: Comparison of rationale for answers between humans and LLMs.
  • Figure 5: Comparison of additional information required by humans and LLMs.