Table of Contents
Fetching ...

Multi-Modal Gaze Following in Conversational Scenarios

Yuqi Hou, Zhongqun Zhang, Nora Horanyi, Jaewon Moon, Yihua Cheng, Hyung Jin Chang

TL;DR

The paper tackles gaze following in conversational scenarios by introducing a multi-modal framework (MMGaze) that fuses audio and visual cues to better infer gaze targets. It leverages lip-audio correlations for active speaker detection, enhances scene images with identity priors, and uses a gaze candidate estimator plus an MLP to map individuals to gaze targets. A key contribution is VideoGazeSpeech, the first gaze-following dataset with synchronized audio, enabling evaluation of audio-vision methods; experiments show that audio-vision fusion substantially improves gaze-target detection over vision-only baselines. The work advances robust gaze understanding in natural social settings, with potential benefits for social robotics and interaction systems where audio cues are informative.

Abstract

Gaze following estimates gaze targets of in-scene person by understanding human behavior and scene information. Existing methods usually analyze scene images for gaze following. However, compared with visual images, audio also provides crucial cues for determining human behavior.This suggests that we can further improve gaze following considering audio cues. In this paper, we explore gaze following tasks in conversational scenarios. We propose a novel multi-modal gaze following framework based on our observation ``audiences tend to focus on the speaker''. We first leverage the correlation between audio and lips, and classify speakers and listeners in a scene. We then use the identity information to enhance scene images and propose a gaze candidate estimation network. The network estimates gaze candidates from enhanced scene images and we use MLP to match subjects with candidates as classification tasks. Existing gaze following datasets focus on visual images while ignore audios.To evaluate our method, we collect a conversational dataset, VideoGazeSpeech (VGS), which is the first gaze following dataset including images and audio. Our method significantly outperforms existing methods in VGS datasets. The visualization result also prove the advantage of audio cues in gaze following tasks. Our work will inspire more researches in multi-modal gaze following estimation.

Multi-Modal Gaze Following in Conversational Scenarios

TL;DR

The paper tackles gaze following in conversational scenarios by introducing a multi-modal framework (MMGaze) that fuses audio and visual cues to better infer gaze targets. It leverages lip-audio correlations for active speaker detection, enhances scene images with identity priors, and uses a gaze candidate estimator plus an MLP to map individuals to gaze targets. A key contribution is VideoGazeSpeech, the first gaze-following dataset with synchronized audio, enabling evaluation of audio-vision methods; experiments show that audio-vision fusion substantially improves gaze-target detection over vision-only baselines. The work advances robust gaze understanding in natural social settings, with potential benefits for social robotics and interaction systems where audio cues are informative.

Abstract

Gaze following estimates gaze targets of in-scene person by understanding human behavior and scene information. Existing methods usually analyze scene images for gaze following. However, compared with visual images, audio also provides crucial cues for determining human behavior.This suggests that we can further improve gaze following considering audio cues. In this paper, we explore gaze following tasks in conversational scenarios. We propose a novel multi-modal gaze following framework based on our observation ``audiences tend to focus on the speaker''. We first leverage the correlation between audio and lips, and classify speakers and listeners in a scene. We then use the identity information to enhance scene images and propose a gaze candidate estimation network. The network estimates gaze candidates from enhanced scene images and we use MLP to match subjects with candidates as classification tasks. Existing gaze following datasets focus on visual images while ignore audios.To evaluate our method, we collect a conversational dataset, VideoGazeSpeech (VGS), which is the first gaze following dataset including images and audio. Our method significantly outperforms existing methods in VGS datasets. The visualization result also prove the advantage of audio cues in gaze following tasks. Our work will inspire more researches in multi-modal gaze following estimation.
Paper Structure (21 sections, 2 equations, 6 figures, 1 table, 1 algorithm)

This paper contains 21 sections, 2 equations, 6 figures, 1 table, 1 algorithm.

Figures (6)

  • Figure 1: We present a novel multimodal framework for conversational gaze following that utilizes audio-vision video input to generate accurate target detection for gaze following. Our approach produces annotated bounding boxes for the speaker, listener, and gaze target. To facilitate our methodology, we introduce VideoGazeSpeech (VGS) including annotated audio and video cues.
  • Figure 2: MMGaze performs gaze following for each frame of videos. Given one frame and audio track, it first performs active speaker detection. MMGaze acquires audio feature of 200 ms (20 samples due to 100HZ) near the timestamp of the given frame. It also acquires corresponding visual images of 200 ms (5 frames due to 25fps). Then, MMGaze extracts audio representation and visual representation corresponding to lip motion of each individual via SyncNet chung2016out. It computes the similarity between audio representation and visual representation of each individual, and distinguishes identity information. MMGaze provides a gaze candidate estimation network. It contains a gaze target detector to estimate gaze target candidates from scene images enhanced by identity information. One multilayer perceptron (MLP) is used to predict the relationships between each subject and all candidates. We select the candidate with the highest probability as the final gaze target for each subject.
  • Figure 3: Example diagram of the VideoGazeSpeech (VGS) database.There are three people in the sample video. Each line in the above figure is labelled with the gaze following each person in the video, with the green box indicating the gaze following the target and the red box indicating the corresponding head of the person producing the gaze work
  • Figure 4: Comparison from Gaze Candidate Estimation model and VAT model. The first line is the output of gaze candidate estimation model, the second line is the output of VAT model, and the third line is the ground truth. Our model outperforms the VAT method in accurately detecting the gaze target. In the first frame, our model accurately detects the gaze target where the VAT method failed to do so. This demonstrates the superior performance of our model in terms of gaze target detection. In the second frame, our model accurately detected the speaker as the gaze target in a conversational scenario, while another model failed. Incorporating audio cues is crucial for gaze following, and audio-visual fusion can significantly improve accuracy, especially in real-world scenarios.
  • Figure 5: Quantitative evaluation in comparison with state-of-the-art methods on our VGS dataset in the AP (Average precision) metric $\uparrow$ (higher is better). Our method outperforms DETR and VAT.
  • ...and 1 more figures