Listen Then See: Video Alignment with Speaker Attention

Aviral Agrawal; Carlos Mateo Samudio Lezcano; Iqui Balam Heredia-Marin; Prabhdeep Singh Sethi

Listen Then See: Video Alignment with Speaker Attention

Aviral Agrawal, Carlos Mateo Samudio Lezcano, Iqui Balam Heredia-Marin, Prabhdeep Singh Sethi

TL;DR

This work targets social intelligence in video-question answering by bridging video, audio, and text through Speaking Turn Sampling (STS) and Vision-Language Cross Contextualization (VLCC). By aligning speaking-turn-based video frames with transcripts via an audio-enabled bridge and fusing them into language space, the approach mitigates language priors and enhances multimodal reasoning. Empirical results on Social IQ 2.0 show a new state-of-the-art accuracy of 82.06%, supported by ablations that demonstrate the value of both visual and linguistic contributions. The method advances SIQA by providing a modular, cross-modal fusion framework with practical implications for robust, context-aware AI in social scenarios, while also acknowledging limitations and potential biases that warrant careful consideration.

Abstract

Video-based Question Answering (Video QA) is a challenging task and becomes even more intricate when addressing Socially Intelligent Question Answering (SIQA). SIQA requires context understanding, temporal reasoning, and the integration of multimodal information, but in addition, it requires processing nuanced human behavior. Furthermore, the complexities involved are exacerbated by the dominance of the primary modality (text) over the others. Thus, there is a need to help the task's secondary modalities to work in tandem with the primary modality. In this work, we introduce a cross-modal alignment and subsequent representation fusion approach that achieves state-of-the-art results (82.06\% accuracy) on the Social IQ 2.0 dataset for SIQA. Our approach exhibits an improved ability to leverage the video modality by using the audio modality as a bridge with the language modality. This leads to enhanced performance by reducing the prevalent issue of language overfitting and resultant video modality bypassing encountered by current existing techniques. Our code and models are publicly available at https://github.com/sts-vlcc/sts-vlcc

Listen Then See: Video Alignment with Speaker Attention

TL;DR

Abstract

Paper Structure (17 sections, 4 equations, 5 figures, 3 tables)

This paper contains 17 sections, 4 equations, 5 figures, 3 tables.

Introduction
Related Work
Methodology
Dataset and Analysis
SIQA Foundational model
Multi-modal alignment: Listen to the video
Modality Fusion: See the video
Results
Primary evaluation metrics
Ablation experiments
Increasing the model dependency on the visual modality
Increasing the model dependency on language modality
Discussion
Limitations
Overall improvement
...and 2 more sections

Figures (5)

Figure 1: Speaking Turn Sampling (STS) and Vision-Language Cross Contextualization (VLCC) in action. In the top dotted rectangle, the audio modality is used to obtain the speaking turn intervals, contributing to our STS. These intervals are used to obtain the lower dotted rectangle, which contain the corresponding video frames and transcript excerpts. These are used in tandem in the model, to obtain jointly contextualized vision-language embeddings.
Figure 2: Example videos and questions in Social-IQ 2.0 dataset social_iq2, a video contains multiple questions, four options where one is correct and three are incorrect.
Figure 3: Speaking Turn Informed Video Frame Sampling Strategy: We focus the sample of the frames only where the people is speaking.
Figure 4: The figure displays the proposed architecture. We run the Speaking Turn Sampling (STS) module to the aligned $frame_i$ from the speaking turn k and the corresponding subtitle from the transcript. We pass this pair to the frozen CLIP encoder to obtain the visual and text encodings respectively. The resultant encodings are passed through the Vision Language Cross Contextualization (VLCC) module and subsequently passed through the projection layer to generate one of the inputs to the LLM. Simultaneously, we generate the text embeddings of size U for each question-answer pair, and the text embeddings of size V for the video subtitles.
Figure 5: The question asked in this video is "What is the tone of the people speaking?". This example shows that our method (in the green box) uses more relevant frames where people are speaking. In contrast, the baseline (in the red box) samples frames that do not contain relevant information for the task. In this example, our model predicts the correct answer, whereas the baseline does not.

Listen Then See: Video Alignment with Speaker Attention

TL;DR

Abstract

Listen Then See: Video Alignment with Speaker Attention

Authors

TL;DR

Abstract

Table of Contents

Figures (5)