Table of Contents
Fetching ...

The Audio-Visual Conversational Graph: From an Egocentric-Exocentric Perspective

Wenqi Jia, Miao Liu, Hao Jiang, Ishwarya Ananthabhotla, James M. Rehg, Vamsi Krishna Ithapu, Ruohan Gao

TL;DR

The Ego-Exocentric Conversational Graph Prediction problem is introduced, marking the first attempt to infer exocentric conversational interactions from egocentric videos.

Abstract

In recent years, the thriving development of research related to egocentric videos has provided a unique perspective for the study of conversational interactions, where both visual and audio signals play a crucial role. While most prior work focus on learning about behaviors that directly involve the camera wearer, we introduce the Ego-Exocentric Conversational Graph Prediction problem, marking the first attempt to infer exocentric conversational interactions from egocentric videos. We propose a unified multi-modal framework -- Audio-Visual Conversational Attention (AV-CONV), for the joint prediction of conversation behaviors -- speaking and listening -- for both the camera wearer as well as all other social partners present in the egocentric video. Specifically, we adopt the self-attention mechanism to model the representations across-time, across-subjects, and across-modalities. To validate our method, we conduct experiments on a challenging egocentric video dataset that includes multi-speaker and multi-conversation scenarios. Our results demonstrate the superior performance of our method compared to a series of baselines. We also present detailed ablation studies to assess the contribution of each component in our model. Check our project page at https://vjwq.github.io/AV-CONV/.

The Audio-Visual Conversational Graph: From an Egocentric-Exocentric Perspective

TL;DR

The Ego-Exocentric Conversational Graph Prediction problem is introduced, marking the first attempt to infer exocentric conversational interactions from egocentric videos.

Abstract

In recent years, the thriving development of research related to egocentric videos has provided a unique perspective for the study of conversational interactions, where both visual and audio signals play a crucial role. While most prior work focus on learning about behaviors that directly involve the camera wearer, we introduce the Ego-Exocentric Conversational Graph Prediction problem, marking the first attempt to infer exocentric conversational interactions from egocentric videos. We propose a unified multi-modal framework -- Audio-Visual Conversational Attention (AV-CONV), for the joint prediction of conversation behaviors -- speaking and listening -- for both the camera wearer as well as all other social partners present in the egocentric video. Specifically, we adopt the self-attention mechanism to model the representations across-time, across-subjects, and across-modalities. To validate our method, we conduct experiments on a challenging egocentric video dataset that includes multi-speaker and multi-conversation scenarios. Our results demonstrate the superior performance of our method compared to a series of baselines. We also present detailed ablation studies to assess the contribution of each component in our model. Check our project page at https://vjwq.github.io/AV-CONV/.
Paper Structure (24 sections, 1 equation, 9 figures, 5 tables)

This paper contains 24 sections, 1 equation, 9 figures, 5 tables.

Figures (9)

  • Figure 1: We propose (d) the Ego-Exocentric Conversational Graph Prediction problem that jointly learns (a) the egocentric behaviors---whether the camera wearer is speaking or listening to others, and (b) the exocentric behaviors---whether the other social partners in the scene are speaking or listening to one another, given only the egocentric video input (c).
  • Figure 2: An illustration of the Conversational Graph. The left, center, and right figures visualize ${G}_{Ego}$, ${G}_{Exo}$, and ${G}_{Conv}$, respectively. See Sec. \ref{['sec:problem']} for details.
  • Figure 3: An example of the edge attributes. We have binary annotations for each pair of the participants in the conversation, including the camera wear and all other partners.
  • Figure 4: Model Architecture Overview: Our model takes multiple egocentric frames and multi-channel audio signals. (a) For each frame, the faces of social partners are cropped to serve as raw visual input, while their corresponding head positions are concatenated with audio inputs to generate positional audio signals. Both visual and audio signals are encoded by two separate ResNet18 Backbones and are concatenated to produce Audio-Visual features for each cropped head. (b) After obtaining temporal Audio-Visual feature tubes of video length, they are flattened into a token to be fed into the Conversational Attention Module to produce augmented Single Head Feature feature $\mathcal{Z}^{av}$. Egocentric Classifiers directly take them to predict Egocentric Edge Attributes, and pairs of these features are arbitrarily combined to generate pairwise audio-visual features to predict Exocentric Edge Attributes.
  • Figure 5: Visualization of the Ego-Exocentric Conversational Graph from our model prediction. We show three successful cases and one failure case in the bottom right. For the last failure example, we also overlay the ground truth of the conversational graph on the top right corner of the video frame as reference.
  • ...and 4 more figures