Learning Unsupervised Gaze Representation via Eye Mask Driven Information Bottleneck
Yangzhou Jiang, Yinxin Lin, Yaoming Wang, Teng Li, Bilian Ke, Bingbing Ni
TL;DR
This work tackles the annotation burden in gaze estimation by proposing Eye Mask Driven Information Bottleneck (EM-IB), a self-supervised framework that learns gaze-oriented representations from full-face images. EM-IB uses a dual-branch architecture (EM-AE and FF-IB) with an injection bottleneck to distill gaze information into eye reconstruction, complemented by an eye/gaze information contrastive loss to avoid overfitting non-eye regions. The approach achieves state-of-the-art performance among unsupervised gaze methods, enabling strong linear and few-shot calibration results, cross-dataset transfer, and even CNN distillation without labeled data. The framework demonstrates marked improvements on multiple benchmarks and offers practical benefits for industrial gaze estimation and cross-domain robustness, with clear ablations supporting the design choices.
Abstract
Appearance-based supervised methods with full-face image input have made tremendous advances in recent gaze estimation tasks. However, intensive human annotation requirement inhibits current methods from achieving industrial level accuracy and robustness. Although current unsupervised pre-training frameworks have achieved success in many image recognition tasks, due to the deep coupling between facial and eye features, such frameworks are still deficient in extracting useful gaze features from full-face. To alleviate above limitations, this work proposes a novel unsupervised/self-supervised gaze pre-training framework, which forces the full-face branch to learn a low dimensional gaze embedding without gaze annotations, through collaborative feature contrast and squeeze modules. In the heart of this framework is an alternating eye-attended/unattended masking training scheme, which squeezes gaze-related information from full-face branch into an eye-masked auto-encoder through an injection bottleneck design that successfully encourages the model to pays more attention to gaze direction rather than facial textures only, while still adopting the eye self-reconstruction objective. In the same time, a novel eye/gaze-related information contrastive loss has been designed to further boost the learned representation by forcing the model to focus on eye-centered regions. Extensive experimental results on several gaze benchmarks demonstrate that the proposed scheme achieves superior performances over unsupervised state-of-the-art.
