Humanizing Robot Gaze Shifts: A Framework for Natural Gaze Shifts in Humanoid Robots

Jingchao Wei; Jingkai Qin; Yuxiao Cao; Jingcheng Huang; Xiangrui Zeng; Min Li; Zhouping Yin

Humanizing Robot Gaze Shifts: A Framework for Natural Gaze Shifts in Humanoid Robots

Jingchao Wei, Jingkai Qin, Yuxiao Cao, Jingcheng Huang, Xiangrui Zeng, Min Li, Zhouping Yin

TL;DR

The Robot Gaze-Shift framework is proposed, which employs a vision--language model (VLM)-based gaze reasoning pipeline and introduces a conditional Vector Quantized-Variational Autoencoder model for eye--head coordinated gaze-shift motion generation, producing diverse and human-like gaze-shift behaviors.

Abstract

Leveraging auditory and visual feedback for attention reorientation is essential for natural gaze shifts in social interaction. However, enabling humanoid robots to perform natural and context-appropriate gaze shifts in unconstrained human--robot interaction (HRI) remains challenging, as it requires the coupling of cognitive attention mechanisms and biomimetic motion generation. In this work, we propose the Robot Gaze-Shift (RGS) framework, which integrates these two components into a unified pipeline. First, RGS employs a vision--language model (VLM)-based gaze reasoning pipeline to infer context-appropriate gaze targets from multimodal interaction cues, ensuring consistency with human gaze-orienting regularities. Second, RGS introduces a conditional Vector Quantized-Variational Autoencoder (VQ-VAE) model for eye--head coordinated gaze-shift motion generation, producing diverse and human-like gaze-shift behaviors. Experiments validate that RGS effectively replicates human-like target selection and generates realistic, diverse gaze-shift motions.

Humanizing Robot Gaze Shifts: A Framework for Natural Gaze Shifts in Humanoid Robots

TL;DR

Abstract

Paper Structure (14 sections, 8 equations, 6 figures, 1 table)

This paper contains 14 sections, 8 equations, 6 figures, 1 table.

Introduction
Related Work
Robot Gaze Reasoning
Gaze-Shift Motion Generation
Methodology
Interaction Scenario Perception
Gaze Reasoning
Gaze-Shift Motion Generation
Experiments
Experimental Setup
Evaluation of Gaze Reasoning Pipeline
Training Analysis of Gaze-Shift Motion Generation Model
Inference-Time Diversity Evaluation of Gaze-Shift Motion Generation
Conclusion

Figures (6)

Figure 1: Overview of our RGS framework.
Figure 2: VLM-Based Gaze Reasoning Pipeline. Notation: superscripts $f$, $m$ and $*$ denote the final frame of a cycle, instance masking, and mark indexing, respectively.
Figure 3: Overview of the proposed gaze-shift motion generation model, trained in two stages. Top: a conditional VQ-VAE is trained to reconstruct eye--head rotation increments from ground-truth motions and conditioning inputs. Bottom: a conditional prior predicts a distribution over codebook entries from the conditioning inputs. The prior selects the maximum-probability code to compute the loss during training, while using stochastic sampling at inference time to enable behavioral diversity.
Figure 4: Qualitative examples of the proposed VLM-based gaze reasoning pipeline across four gaze-orienting regularities (H1--H4). Each row corresponds to one regularity and shows three representative inference cycles (T1--T3) selected at different times. The three cycles are temporally ordered but not necessarily consecutive. The red semi-transparent overlay indicates the gaze target selected by the pipeline at each illustrated inference cycle.
Figure 5: Validation-set MGD curves for the two-stage training. (a) Training Stage-1: conditional VQ-VAE reconstruction errors. (b) Training Stage-2: conditional prior inference errors. Both stages exhibit consistent error reduction and stable convergence.
...and 1 more figures

Humanizing Robot Gaze Shifts: A Framework for Natural Gaze Shifts in Humanoid Robots

TL;DR

Abstract

Humanizing Robot Gaze Shifts: A Framework for Natural Gaze Shifts in Humanoid Robots

Authors

TL;DR

Abstract

Table of Contents

Figures (6)