Table of Contents
Fetching ...

Learning Unsupervised Gaze Representation via Eye Mask Driven Information Bottleneck

Yangzhou Jiang, Yinxin Lin, Yaoming Wang, Teng Li, Bilian Ke, Bingbing Ni

TL;DR

This work tackles the annotation burden in gaze estimation by proposing Eye Mask Driven Information Bottleneck (EM-IB), a self-supervised framework that learns gaze-oriented representations from full-face images. EM-IB uses a dual-branch architecture (EM-AE and FF-IB) with an injection bottleneck to distill gaze information into eye reconstruction, complemented by an eye/gaze information contrastive loss to avoid overfitting non-eye regions. The approach achieves state-of-the-art performance among unsupervised gaze methods, enabling strong linear and few-shot calibration results, cross-dataset transfer, and even CNN distillation without labeled data. The framework demonstrates marked improvements on multiple benchmarks and offers practical benefits for industrial gaze estimation and cross-domain robustness, with clear ablations supporting the design choices.

Abstract

Appearance-based supervised methods with full-face image input have made tremendous advances in recent gaze estimation tasks. However, intensive human annotation requirement inhibits current methods from achieving industrial level accuracy and robustness. Although current unsupervised pre-training frameworks have achieved success in many image recognition tasks, due to the deep coupling between facial and eye features, such frameworks are still deficient in extracting useful gaze features from full-face. To alleviate above limitations, this work proposes a novel unsupervised/self-supervised gaze pre-training framework, which forces the full-face branch to learn a low dimensional gaze embedding without gaze annotations, through collaborative feature contrast and squeeze modules. In the heart of this framework is an alternating eye-attended/unattended masking training scheme, which squeezes gaze-related information from full-face branch into an eye-masked auto-encoder through an injection bottleneck design that successfully encourages the model to pays more attention to gaze direction rather than facial textures only, while still adopting the eye self-reconstruction objective. In the same time, a novel eye/gaze-related information contrastive loss has been designed to further boost the learned representation by forcing the model to focus on eye-centered regions. Extensive experimental results on several gaze benchmarks demonstrate that the proposed scheme achieves superior performances over unsupervised state-of-the-art.

Learning Unsupervised Gaze Representation via Eye Mask Driven Information Bottleneck

TL;DR

This work tackles the annotation burden in gaze estimation by proposing Eye Mask Driven Information Bottleneck (EM-IB), a self-supervised framework that learns gaze-oriented representations from full-face images. EM-IB uses a dual-branch architecture (EM-AE and FF-IB) with an injection bottleneck to distill gaze information into eye reconstruction, complemented by an eye/gaze information contrastive loss to avoid overfitting non-eye regions. The approach achieves state-of-the-art performance among unsupervised gaze methods, enabling strong linear and few-shot calibration results, cross-dataset transfer, and even CNN distillation without labeled data. The framework demonstrates marked improvements on multiple benchmarks and offers practical benefits for industrial gaze estimation and cross-domain robustness, with clear ablations supporting the design choices.

Abstract

Appearance-based supervised methods with full-face image input have made tremendous advances in recent gaze estimation tasks. However, intensive human annotation requirement inhibits current methods from achieving industrial level accuracy and robustness. Although current unsupervised pre-training frameworks have achieved success in many image recognition tasks, due to the deep coupling between facial and eye features, such frameworks are still deficient in extracting useful gaze features from full-face. To alleviate above limitations, this work proposes a novel unsupervised/self-supervised gaze pre-training framework, which forces the full-face branch to learn a low dimensional gaze embedding without gaze annotations, through collaborative feature contrast and squeeze modules. In the heart of this framework is an alternating eye-attended/unattended masking training scheme, which squeezes gaze-related information from full-face branch into an eye-masked auto-encoder through an injection bottleneck design that successfully encourages the model to pays more attention to gaze direction rather than facial textures only, while still adopting the eye self-reconstruction objective. In the same time, a novel eye/gaze-related information contrastive loss has been designed to further boost the learned representation by forcing the model to focus on eye-centered regions. Extensive experimental results on several gaze benchmarks demonstrate that the proposed scheme achieves superior performances over unsupervised state-of-the-art.
Paper Structure (28 sections, 4 equations, 7 figures, 7 tables)

This paper contains 28 sections, 4 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Motivation of the proposed work. Masked Auto-Encoder (MAE) he2022masked learns by completing highly masked images (and eyes are most probably masked out), which highly possibly yields over-fitting eye/gaze representations. Auto-Encoder (AE) vincent2008extracting learns from whole image dimensionality reduction and reconstruction, and eye/gaze information is attenuated during full-face feature learning and pooling. Thus, both AE and MAE tend to learn general face information (or even non-eye information such as facial texture) instead of focus on gaze information. In contrast, our proposed Eye Mask Driven Information Bottleneck (EM-IB) unsupervised learning scheme, can successfully enforce the model to concentrate on eye-gaze related information by 1) a full-face global feature injection via a novel information bottleneck structure design and 2) a newly proposed eye/gaze information contrastive training loss.
  • Figure 2: Illustration of our proposed Eye Mask Driven Injection Bottleneck (EM-IB) unsupervised gaze learning pipeline. The upper branch is the Eye-masked Auto-Encoders (EM-AE), which extracts eye/gaze-related information based on the unmasked patches to reconstruct the masked eye-area. The bottom branch is the Full-face Injection Bottleneck (FF-IB), which injects a compressed full-face level gaze-related vector to the EM-AE module. The eye/gaze information self-squeeze is achieved by this asymmetric encoder-decoder structure. During unsupervised pre-training phase, both MSE loss and eye/gaze information contrastive loss are utilized. For linear probing in gaze estimator training phase, conventional 2D gaze angular loss is utilized. Gradient flows (within the weight sharing ViT encoder structure) for both reconstruction and injection branches are also indicated.
  • Figure 3: Visualization of the reconstructed eyes from masked image input. We show the ground-truth (a) and the masked faces and eye patches (b)-(d) reconstructed by MAE (b), AE (c) and our proposed EM-IB (d), respectively.
  • Figure 4: Results of fine-tuning the model with the subset of Gaze360 and XGaze. The backbone is ViT-tiny.
  • Figure 5: Visualization of gaze reconstructed eyes. For top to bottom rows, we show the 1) original ground-truth image, and results obtained by 2) our EM-IB, 3) AE and 4) MAE-single, respectively.
  • ...and 2 more figures