Table of Contents
Fetching ...

CSGaze: Context-aware Social Gaze Prediction

Surbhi Madan, Shreya Ghosh, Ramanathan Subramanian, Abhinav Dhall, Tom Gedeon

TL;DR

CSGaze addresses the challenge of predicting social gaze in multi-person conversations by fusing facial cues, scene context, and linguistic context from a multilingual language model. The method introduces a fine-grained principal-speaker attention and cross-modal fusion between face, scene, and contextual features, trained in two phases with pretraining on gaze datasets. It achieves state-of-the-art or competitive results on GP-Static and LAEO benchmarks and shows strong generalization to UCO-LAEO, AVA-LAEO, and VSGaze, while remaining lightweight (~54M parameters). The work also provides initial explainability via attention scores, offering interpretability of model decisions. Overall, the approach demonstrates the value of language-guided contextualization for robust and scalable social gaze understanding in real-world scenarios.

Abstract

A person's gaze offers valuable insights into their focus of attention, level of social engagement, and confidence. In this work, we investigate how contextual cues combined with visual scene and facial information can be effectively utilized to predict and interpret social gaze patterns during conversational interactions. We introduce CSGaze, a context aware multimodal approach that leverages facial, scene information as complementary inputs to enhance social gaze pattern prediction from multi-person images. The model also incorporates a fine-grained attention mechanism centered on the principal speaker, which helps in better modeling social gaze dynamics. Experimental results show that CSGaze performs competitively with state-of-the-art methods on GP-Static, UCO-LAEO and AVA-LAEO. Our findings highlight the role of contextual cues in improving social gaze prediction. Additionally, we provide initial explainability through generated attention scores, offering insights into the model's decision-making process. We also demonstrate our model's generalizability by testing our model on open set datasets that demonstrating its robustness across diverse scenarios.

CSGaze: Context-aware Social Gaze Prediction

TL;DR

CSGaze addresses the challenge of predicting social gaze in multi-person conversations by fusing facial cues, scene context, and linguistic context from a multilingual language model. The method introduces a fine-grained principal-speaker attention and cross-modal fusion between face, scene, and contextual features, trained in two phases with pretraining on gaze datasets. It achieves state-of-the-art or competitive results on GP-Static and LAEO benchmarks and shows strong generalization to UCO-LAEO, AVA-LAEO, and VSGaze, while remaining lightweight (~54M parameters). The work also provides initial explainability via attention scores, offering interpretability of model decisions. Overall, the approach demonstrates the value of language-guided contextualization for robust and scalable social gaze understanding in real-world scenarios.

Abstract

A person's gaze offers valuable insights into their focus of attention, level of social engagement, and confidence. In this work, we investigate how contextual cues combined with visual scene and facial information can be effectively utilized to predict and interpret social gaze patterns during conversational interactions. We introduce CSGaze, a context aware multimodal approach that leverages facial, scene information as complementary inputs to enhance social gaze pattern prediction from multi-person images. The model also incorporates a fine-grained attention mechanism centered on the principal speaker, which helps in better modeling social gaze dynamics. Experimental results show that CSGaze performs competitively with state-of-the-art methods on GP-Static, UCO-LAEO and AVA-LAEO. Our findings highlight the role of contextual cues in improving social gaze prediction. Additionally, we provide initial explainability through generated attention scores, offering insights into the model's decision-making process. We also demonstrate our model's generalizability by testing our model on open set datasets that demonstrating its robustness across diverse scenarios.

Paper Structure

This paper contains 20 sections, 2 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Illustration of static gaze patterns in dyadic communication. The five figures depict different gaze patterns, with the girl as the principal and the boy as the associate.
  • Figure 2: Sample frames with their contextual information from three datasets, Top: GP-Static, Middle: UCO-LAEO and End: AVA-LAEO.
  • Figure 3: CS-Gaze Framework. Phase 1: We pre-train the scene and face encoders using the GazeFollow dataset nips15_recasens, which contains images annotated with gaze locations, providing supervisory signals for gaze estimation (See Heatmap, headlocation and gaze vector). Stage 2: Given an image with two interacting individuals: Principal (Green bounding box) and Associate (Pink bounding box), their cropped facial regions are processed through the pretrained face encoder to extract face-specific embeddings. To model the interaction between the two individuals, we employ an attention-based fusion mechanism, where the embeddings of both individuals are combined using learnable weighting parameters that determine the relative importance of each person’s facial features in the final representation.
  • Figure 4: Sample images from the GP-Static dataset for each corresponding class label.
  • Figure 5: Qualitative Analysis. Comparison of our proposed approach, CSGaze, with the baseline method gaze across four diverse datasets: GP-Static, AVA-LAEO, UCO-LAEO, and VideoCoAtt. In each example, green bounding boxes indicate the principal face, while pink boxes denote the associate. Predictions are color-coded: blue for correct and red for incorrect outputs.