GazeCLIP: Enhancing Gaze Estimation Through Text-Guided Multimodal Learning
Jun Wang, Hao Ruan, Liangjian Wen, Yong Dai, Mingjie Wang
TL;DR
This paper introduces GazeCLIP, a text-guided gaze estimation framework that leverages CLIP to inject linguistic priors into gaze regression. A predefined textual prompt guides coarse direction via zero-shot CLIP, while a cross-attention fusion module combines image and text features for fine-grained Gaze prediction, with an MLP head projecting to yaw and pitch. Experiments on MPIIFaceGaze, EyeDiap, and RT-Gene show state-of-the-art angular error reductions (e.g., ~0.4° average improvement) and demonstrate the importance of language knowledge, prompt design, and fusion strategy. The work highlights the potential of visual-language collaboration to advance gaze estimation and broader multimodal learning in vision tasks.
Abstract
Visual gaze estimation, with its wide-ranging application scenarios, has garnered increasing attention within the research community. Although existing approaches infer gaze solely from image signals, recent advances in visual-language collaboration have demonstrated that the integration of linguistic information can significantly enhance performance across various visual tasks. Leveraging the remarkable transferability of large-scale Contrastive Language-Image Pre-training (CLIP) models, we address the open and urgent question of how to effectively apply linguistic cues to gaze estimation. In this work, we propose GazeCLIP, a novel gaze estimation framework that deeply explores text-face collaboration. Specifically, we introduce a meticulously designed linguistic description generator to produce text signals enriched with coarse directional cues. Furthermore, we present a CLIP-based backbone adept at characterizing text-face pairs for gaze estimation, complemented by a fine-grained multimodal fusion module that models the intricate interrelationships between heterogeneous inputs. Extensive experiments on three challenging datasets demonstrate the superiority of GazeCLIP, which achieves state-of-the-art accuracy. Our findings underscore the potential of using visual-language collaboration to advance gaze estimation and open new avenues for future research in multimodal learning for visual tasks. The implementation code and the pre-trained model will be made publicly available.
