Table of Contents
Fetching ...

CLIP-Gaze: Towards General Gaze Estimation via Visual-Linguistic Model

Pengwei Yin, Guanzhong Zeng, Jingjing Wang, Di Xie

TL;DR

CLIP-Gaze addresses cross-domain gaze estimation by integrating a vision-language model to disentangle gaze-relevant from gaze-irrelevant information using language-described distractors. It introduces Personalized Context Optimization to tailor text prompts per identity and a Rank Gaze-Relevant Features loss to shape the feature distribution across samples, all trained with a joint loss that aligns CLIP space, separates gaze-irrelevant factors, and preserves gaze structure. The method achieves state-of-the-art cross-domain performance on four benchmarks without target-domain data and demonstrates clear gains from prompt personalization and relational feature losses. This approach broadens gaze-estimation robustness by leveraging external linguistic knowledge, offering practical benefits for real-world HCI and driver-monitoring applications where domain shifts are common.

Abstract

Gaze estimation methods often experience significant performance degradation when evaluated across different domains, due to the domain gap between the testing and training data. Existing methods try to address this issue using various domain generalization approaches, but with little success because of the limited diversity of gaze datasets, such as appearance, wearable, and image quality. To overcome these limitations, we propose a novel framework called CLIP-Gaze that utilizes a pre-trained vision-language model to leverage its transferable knowledge. Our framework is the first to leverage the vision-and-language cross-modality approach for gaze estimation task. Specifically, we extract gaze-relevant feature by pushing it away from gaze-irrelevant features which can be flexibly constructed via language descriptions. To learn more suitable prompts, we propose a personalized context optimization method for text prompt tuning. Furthermore, we utilize the relationship among gaze samples to refine the distribution of gaze-relevant features, thereby improving the generalization capability of the gaze estimation model. Extensive experiments demonstrate the excellent performance of CLIP-Gaze over existing methods on four cross-domain evaluations.

CLIP-Gaze: Towards General Gaze Estimation via Visual-Linguistic Model

TL;DR

CLIP-Gaze addresses cross-domain gaze estimation by integrating a vision-language model to disentangle gaze-relevant from gaze-irrelevant information using language-described distractors. It introduces Personalized Context Optimization to tailor text prompts per identity and a Rank Gaze-Relevant Features loss to shape the feature distribution across samples, all trained with a joint loss that aligns CLIP space, separates gaze-irrelevant factors, and preserves gaze structure. The method achieves state-of-the-art cross-domain performance on four benchmarks without target-domain data and demonstrates clear gains from prompt personalization and relational feature losses. This approach broadens gaze-estimation robustness by leveraging external linguistic knowledge, offering practical benefits for real-world HCI and driver-monitoring applications where domain shifts are common.

Abstract

Gaze estimation methods often experience significant performance degradation when evaluated across different domains, due to the domain gap between the testing and training data. Existing methods try to address this issue using various domain generalization approaches, but with little success because of the limited diversity of gaze datasets, such as appearance, wearable, and image quality. To overcome these limitations, we propose a novel framework called CLIP-Gaze that utilizes a pre-trained vision-language model to leverage its transferable knowledge. Our framework is the first to leverage the vision-and-language cross-modality approach for gaze estimation task. Specifically, we extract gaze-relevant feature by pushing it away from gaze-irrelevant features which can be flexibly constructed via language descriptions. To learn more suitable prompts, we propose a personalized context optimization method for text prompt tuning. Furthermore, we utilize the relationship among gaze samples to refine the distribution of gaze-relevant features, thereby improving the generalization capability of the gaze estimation model. Extensive experiments demonstrate the excellent performance of CLIP-Gaze over existing methods on four cross-domain evaluations.
Paper Structure (20 sections, 9 equations, 4 figures, 4 tables)

This paper contains 20 sections, 9 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: (1) Top subgraph: The conventional gaze generalization approach enhances the model’s robustness by adversarial training, but can only mitigate a few gaze-irrelevant factors. (2) Bottom subgraph: Our method, CLIP-Gaze, constructs a text prompt from diverse language descriptions to obtain gaze-irrelevant features, and then push away the gaze-relevant feature from gaze-irrelevant features in the feature space to handle various gaze disturbing factors and achieve a robust model.
  • Figure 2: Overview of our CLIP-Gaze framework. We promote gaze domain generalization by introducing abundant knowledge outside the source domain to explicitly eliminate gaze-irrelevant features.
  • Figure 3: Our method, Personalized Context Optimization (PCO), has two learnable components: a context vector set and a lightweight neural network (Meta-Net) that produces a facial token for each identity, while the vision encoder, text encoder and 3DMM model are froze during training.
  • Figure 4: Visualization of the feature distribution. Different colors denotes different gaze directions and close gaze directions share similar colors. (Best viewed in color).