Table of Contents
Fetching ...

VEGAS: Towards Visually Explainable and Grounded Artificial Social Intelligence

Hao Li, Hao Fei, Zechao Hu, Zhengwei Yang, Zheng Wang

TL;DR

VEGAS introduces a visually explainable, grounded approach to Social-IQ by replacing closed-set MCQ with open-ended reasoning. It pairs Language Guided Sampling to pull question-relevant frames, a Temporal Attention Module to restore frame order, and Generalist Instruction Fine-Tuning to produce a VEGAS-generalist with deep social reasoning. Through extensive multimodal training and evaluation, VEGAS reduces language shortcut reliance and demonstrates credible, visually grounded explanations, achieving state-of-the-art results on open-ended and MCQ tasks and excelling in emotion understanding. The work advances human-like social AI by integrating robust visual grounding, explainability, and expert-aligned reasoning capabilities.

Abstract

Social Intelligence Queries (Social-IQ) serve as the primary multimodal benchmark for evaluating a model's social intelligence level. While impressive multiple-choice question(MCQ) accuracy is achieved by current solutions, increasing evidence shows that they are largely, and in some cases entirely, dependent on language modality, overlooking visual context. Additionally, the closed-set nature further prevents the exploration of whether and to what extent the reasoning path behind selection is correct. To address these limitations, we propose the Visually Explainable and Grounded Artificial Social Intelligence (VEGAS) model. As a generative multimodal model, VEGAS leverages open-ended answering to provide explainable responses, which enhances the clarity and evaluation of reasoning paths. To enable visually grounded answering, we propose a novel sampling strategy to provide the model with more relevant visual frames. We then enhance the model's interpretation of these frames through Generalist Instruction Fine-Tuning (GIFT), which aims to: i) learn multimodal-language transformations for fundamental emotional social traits, and ii) establish multimodal joint reasoning capabilities. Extensive experiments, comprising modality ablation, open-ended assessments, and supervised MCQ evaluations, consistently show that VEGAS effectively utilizes visual information in reasoning to produce correct and also credible answers. We expect this work to of fer a new perspective on Social-IQ and advance the development of human-like social AI.

VEGAS: Towards Visually Explainable and Grounded Artificial Social Intelligence

TL;DR

VEGAS introduces a visually explainable, grounded approach to Social-IQ by replacing closed-set MCQ with open-ended reasoning. It pairs Language Guided Sampling to pull question-relevant frames, a Temporal Attention Module to restore frame order, and Generalist Instruction Fine-Tuning to produce a VEGAS-generalist with deep social reasoning. Through extensive multimodal training and evaluation, VEGAS reduces language shortcut reliance and demonstrates credible, visually grounded explanations, achieving state-of-the-art results on open-ended and MCQ tasks and excelling in emotion understanding. The work advances human-like social AI by integrating robust visual grounding, explainability, and expert-aligned reasoning capabilities.

Abstract

Social Intelligence Queries (Social-IQ) serve as the primary multimodal benchmark for evaluating a model's social intelligence level. While impressive multiple-choice question(MCQ) accuracy is achieved by current solutions, increasing evidence shows that they are largely, and in some cases entirely, dependent on language modality, overlooking visual context. Additionally, the closed-set nature further prevents the exploration of whether and to what extent the reasoning path behind selection is correct. To address these limitations, we propose the Visually Explainable and Grounded Artificial Social Intelligence (VEGAS) model. As a generative multimodal model, VEGAS leverages open-ended answering to provide explainable responses, which enhances the clarity and evaluation of reasoning paths. To enable visually grounded answering, we propose a novel sampling strategy to provide the model with more relevant visual frames. We then enhance the model's interpretation of these frames through Generalist Instruction Fine-Tuning (GIFT), which aims to: i) learn multimodal-language transformations for fundamental emotional social traits, and ii) establish multimodal joint reasoning capabilities. Extensive experiments, comprising modality ablation, open-ended assessments, and supervised MCQ evaluations, consistently show that VEGAS effectively utilizes visual information in reasoning to produce correct and also credible answers. We expect this work to of fer a new perspective on Social-IQ and advance the development of human-like social AI.

Paper Structure

This paper contains 30 sections, 9 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: (a) An existing approach selects the correct option without knowing the question or even the video context, revealing incorrect rationale in the open-ended answers. (b) Our study begins with a correct reasoning path grounded in the video, ensuring reliable selection. (c) Our model enhances visual engagement, reduces the language shortcut, and achieves comparable but more reliable MCQ accuracy. * denotes the baseline of the corresponding method.
  • Figure 2: Architecture of VEGAS. The system encodes multimodal inputs with frozen encoders. These inputs are projected into LLM space using a trainable Multimodal Projector, enabling nuanced answer generation that captures social attitudes in interactions like emotions.
  • Figure 3: Left: The proposed Language Guided Sampling modules. Right: Three datasets are crafted for LGS training, incorporating descriptive, causal, and nuanced language cues.
  • Figure 4: Open-ended QA examples of VEGAS (blue arrows) and VEGAS-generalist (orange arrows) using video alone and video with subtitles, respectively.
  • Figure 5: Video captioning examples from EmVidCap.