Table of Contents
Fetching ...

Humanizing Robot Gaze Shifts: A Framework for Natural Gaze Shifts in Humanoid Robots

Jingchao Wei, Jingkai Qin, Yuxiao Cao, Jingcheng Huang, Xiangrui Zeng, Min Li, Zhouping Yin

TL;DR

The Robot Gaze-Shift framework is proposed, which employs a vision--language model (VLM)-based gaze reasoning pipeline and introduces a conditional Vector Quantized-Variational Autoencoder model for eye--head coordinated gaze-shift motion generation, producing diverse and human-like gaze-shift behaviors.

Abstract

Leveraging auditory and visual feedback for attention reorientation is essential for natural gaze shifts in social interaction. However, enabling humanoid robots to perform natural and context-appropriate gaze shifts in unconstrained human--robot interaction (HRI) remains challenging, as it requires the coupling of cognitive attention mechanisms and biomimetic motion generation. In this work, we propose the Robot Gaze-Shift (RGS) framework, which integrates these two components into a unified pipeline. First, RGS employs a vision--language model (VLM)-based gaze reasoning pipeline to infer context-appropriate gaze targets from multimodal interaction cues, ensuring consistency with human gaze-orienting regularities. Second, RGS introduces a conditional Vector Quantized-Variational Autoencoder (VQ-VAE) model for eye--head coordinated gaze-shift motion generation, producing diverse and human-like gaze-shift behaviors. Experiments validate that RGS effectively replicates human-like target selection and generates realistic, diverse gaze-shift motions.

Humanizing Robot Gaze Shifts: A Framework for Natural Gaze Shifts in Humanoid Robots

TL;DR

The Robot Gaze-Shift framework is proposed, which employs a vision--language model (VLM)-based gaze reasoning pipeline and introduces a conditional Vector Quantized-Variational Autoencoder model for eye--head coordinated gaze-shift motion generation, producing diverse and human-like gaze-shift behaviors.

Abstract

Leveraging auditory and visual feedback for attention reorientation is essential for natural gaze shifts in social interaction. However, enabling humanoid robots to perform natural and context-appropriate gaze shifts in unconstrained human--robot interaction (HRI) remains challenging, as it requires the coupling of cognitive attention mechanisms and biomimetic motion generation. In this work, we propose the Robot Gaze-Shift (RGS) framework, which integrates these two components into a unified pipeline. First, RGS employs a vision--language model (VLM)-based gaze reasoning pipeline to infer context-appropriate gaze targets from multimodal interaction cues, ensuring consistency with human gaze-orienting regularities. Second, RGS introduces a conditional Vector Quantized-Variational Autoencoder (VQ-VAE) model for eye--head coordinated gaze-shift motion generation, producing diverse and human-like gaze-shift behaviors. Experiments validate that RGS effectively replicates human-like target selection and generates realistic, diverse gaze-shift motions.
Paper Structure (14 sections, 8 equations, 6 figures, 1 table)

This paper contains 14 sections, 8 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Overview of our RGS framework.
  • Figure 2: VLM-Based Gaze Reasoning Pipeline. Notation: superscripts $f$, $m$ and $*$ denote the final frame of a cycle, instance masking, and mark indexing, respectively.
  • Figure 3: Overview of the proposed gaze-shift motion generation model, trained in two stages. Top: a conditional VQ-VAE is trained to reconstruct eye--head rotation increments from ground-truth motions and conditioning inputs. Bottom: a conditional prior predicts a distribution over codebook entries from the conditioning inputs. The prior selects the maximum-probability code to compute the loss during training, while using stochastic sampling at inference time to enable behavioral diversity.
  • Figure 4: Qualitative examples of the proposed VLM-based gaze reasoning pipeline across four gaze-orienting regularities (H1--H4). Each row corresponds to one regularity and shows three representative inference cycles (T1--T3) selected at different times. The three cycles are temporally ordered but not necessarily consecutive. The red semi-transparent overlay indicates the gaze target selected by the pipeline at each illustrated inference cycle.
  • Figure 5: Validation-set MGD curves for the two-stage training. (a) Training Stage-1: conditional VQ-VAE reconstruction errors. (b) Training Stage-2: conditional prior inference errors. Both stages exhibit consistent error reduction and stable convergence.
  • ...and 1 more figures