Table of Contents
Fetching ...

GazeMoE: Perception of Gaze Target with Mixture-of-Experts

Zhuangzhuang Dai, Zhongxi Lu, Vincent G. Zakka, Luis J. Manso, Jose M Alcaraz Calero, Chen Li

TL;DR

GazeMoE is proposed, a novel end-to-end framework that selectively leverages gaze-target-related cues from a frozen foundation model through MoE modules that incorporates a class-balancing auxiliary loss alongside strategic data augmentations, including region-specific cropping and photometric transformations.

Abstract

Estimating human gaze target from visible images is a critical task for robots to understand human attention, yet the development of generalizable neural architectures and training paradigms remains challenging. While recent advances in pre-trained vision foundation models offer promising avenues for locating gaze targets, the integration of multi-modal cues -- including eyes, head poses, gestures, and contextual features -- demands adaptive and efficient decoding mechanisms. Inspired by Mixture-of-Experts (MoE) for adaptive domain expertise in large vision-language models, we propose GazeMoE, a novel end-to-end framework that selectively leverages gaze-target-related cues from a frozen foundation model through MoE modules. To address class imbalance in gaze target classification (in-frame vs. out-of-frame) and enhance robustness, GazeMoE incorporates a class-balancing auxiliary loss alongside strategic data augmentations, including region-specific cropping and photometric transformations. Extensive experiments on benchmark datasets demonstrate that our GazeMoE achieves state-of-the-art performance, outperforming existing methods on challenging gaze estimation tasks. The code and pre-trained models are released at https://huggingface.co/zdai257/GazeMoE

GazeMoE: Perception of Gaze Target with Mixture-of-Experts

TL;DR

GazeMoE is proposed, a novel end-to-end framework that selectively leverages gaze-target-related cues from a frozen foundation model through MoE modules that incorporates a class-balancing auxiliary loss alongside strategic data augmentations, including region-specific cropping and photometric transformations.

Abstract

Estimating human gaze target from visible images is a critical task for robots to understand human attention, yet the development of generalizable neural architectures and training paradigms remains challenging. While recent advances in pre-trained vision foundation models offer promising avenues for locating gaze targets, the integration of multi-modal cues -- including eyes, head poses, gestures, and contextual features -- demands adaptive and efficient decoding mechanisms. Inspired by Mixture-of-Experts (MoE) for adaptive domain expertise in large vision-language models, we propose GazeMoE, a novel end-to-end framework that selectively leverages gaze-target-related cues from a frozen foundation model through MoE modules. To address class imbalance in gaze target classification (in-frame vs. out-of-frame) and enhance robustness, GazeMoE incorporates a class-balancing auxiliary loss alongside strategic data augmentations, including region-specific cropping and photometric transformations. Extensive experiments on benchmark datasets demonstrate that our GazeMoE achieves state-of-the-art performance, outperforming existing methods on challenging gaze estimation tasks. The code and pre-trained models are released at https://huggingface.co/zdai257/GazeMoE
Paper Structure (12 sections, 11 equations, 3 figures, 8 tables)

This paper contains 12 sections, 11 equations, 3 figures, 8 tables.

Figures (3)

  • Figure 1: GazeMoE data flow diagram. A frozen DINOv2 is used to extract fine-grained scene representations alongside the Mixture-of-Experts (MoE) module specialized in selectively routing gaze-related cues.
  • Figure 2: GazeMoE architecture. Given an input image, the model predicts whether a person's gaze target is in-frame or out-of-frame and where target is. A frozen DINOv2 ViT-L backbone is used to extract fine-grained scene representations. The GazeMoE decoder with three Mix-of-Experts blocks is specialized in selectively routing gaze-target-related features.
  • Figure 3: Qualitative results. We use dlib face detector to automate head bounding boxes prompting as green frames.