Table of Contents
Fetching ...

GazeCLIP: Enhancing Gaze Estimation Through Text-Guided Multimodal Learning

Jun Wang, Hao Ruan, Liangjian Wen, Yong Dai, Mingjie Wang

TL;DR

This paper introduces GazeCLIP, a text-guided gaze estimation framework that leverages CLIP to inject linguistic priors into gaze regression. A predefined textual prompt guides coarse direction via zero-shot CLIP, while a cross-attention fusion module combines image and text features for fine-grained Gaze prediction, with an MLP head projecting to yaw and pitch. Experiments on MPIIFaceGaze, EyeDiap, and RT-Gene show state-of-the-art angular error reductions (e.g., ~0.4° average improvement) and demonstrate the importance of language knowledge, prompt design, and fusion strategy. The work highlights the potential of visual-language collaboration to advance gaze estimation and broader multimodal learning in vision tasks.

Abstract

Visual gaze estimation, with its wide-ranging application scenarios, has garnered increasing attention within the research community. Although existing approaches infer gaze solely from image signals, recent advances in visual-language collaboration have demonstrated that the integration of linguistic information can significantly enhance performance across various visual tasks. Leveraging the remarkable transferability of large-scale Contrastive Language-Image Pre-training (CLIP) models, we address the open and urgent question of how to effectively apply linguistic cues to gaze estimation. In this work, we propose GazeCLIP, a novel gaze estimation framework that deeply explores text-face collaboration. Specifically, we introduce a meticulously designed linguistic description generator to produce text signals enriched with coarse directional cues. Furthermore, we present a CLIP-based backbone adept at characterizing text-face pairs for gaze estimation, complemented by a fine-grained multimodal fusion module that models the intricate interrelationships between heterogeneous inputs. Extensive experiments on three challenging datasets demonstrate the superiority of GazeCLIP, which achieves state-of-the-art accuracy. Our findings underscore the potential of using visual-language collaboration to advance gaze estimation and open new avenues for future research in multimodal learning for visual tasks. The implementation code and the pre-trained model will be made publicly available.

GazeCLIP: Enhancing Gaze Estimation Through Text-Guided Multimodal Learning

TL;DR

This paper introduces GazeCLIP, a text-guided gaze estimation framework that leverages CLIP to inject linguistic priors into gaze regression. A predefined textual prompt guides coarse direction via zero-shot CLIP, while a cross-attention fusion module combines image and text features for fine-grained Gaze prediction, with an MLP head projecting to yaw and pitch. Experiments on MPIIFaceGaze, EyeDiap, and RT-Gene show state-of-the-art angular error reductions (e.g., ~0.4° average improvement) and demonstrate the importance of language knowledge, prompt design, and fusion strategy. The work highlights the potential of visual-language collaboration to advance gaze estimation and broader multimodal learning in vision tasks.

Abstract

Visual gaze estimation, with its wide-ranging application scenarios, has garnered increasing attention within the research community. Although existing approaches infer gaze solely from image signals, recent advances in visual-language collaboration have demonstrated that the integration of linguistic information can significantly enhance performance across various visual tasks. Leveraging the remarkable transferability of large-scale Contrastive Language-Image Pre-training (CLIP) models, we address the open and urgent question of how to effectively apply linguistic cues to gaze estimation. In this work, we propose GazeCLIP, a novel gaze estimation framework that deeply explores text-face collaboration. Specifically, we introduce a meticulously designed linguistic description generator to produce text signals enriched with coarse directional cues. Furthermore, we present a CLIP-based backbone adept at characterizing text-face pairs for gaze estimation, complemented by a fine-grained multimodal fusion module that models the intricate interrelationships between heterogeneous inputs. Extensive experiments on three challenging datasets demonstrate the superiority of GazeCLIP, which achieves state-of-the-art accuracy. Our findings underscore the potential of using visual-language collaboration to advance gaze estimation and open new avenues for future research in multimodal learning for visual tasks. The implementation code and the pre-trained model will be made publicly available.
Paper Structure (21 sections, 16 equations, 6 figures, 4 tables, 1 algorithm)

This paper contains 21 sections, 16 equations, 6 figures, 4 tables, 1 algorithm.

Figures (6)

  • Figure 1: (a) Existing single-modal approaches directly learn gaze-oriented representations from 2D face/eye images via CNNs-based structures, whereas (b) Our proposed novel GazeCLIP delves deep into the synergistic effects of text-image features.
  • Figure 2: GazeCLIP adopts the pairs of facial images and corresponding textual description as its input and leverages the image and text encoders of the CLIP model as its foundational backbone for feature extraction. During the training phase, the image encoder is fine-tuned to adapt to the specific requirements of gaze estimation, while the parameters of the text encoder remain frozen to preserve the pre-trained linguistic knowledge. This design ensures that the model retains the robust semantic understanding of CLIP while optimizing its visual feature extraction capabilities for the task at hand.
  • Figure 3: The results of an ablation study evaluating different feature fusion approaches. The visual-linguistic interaction module, which leverages a cross-attention mechanism combined with residual connections, demonstrates superior capability in effectively integrating features from both visual and textual modalities, leading to enhanced performance in gaze estimation.
  • Figure 4: Images assigned in different coarse directions including fornt, down, left and right.
  • Figure 5: Visualization of inferred results. Red lines represent ground-truth annotations, while blue lines indicate model predictions.
  • ...and 1 more figures