Table of Contents
Fetching ...

Prime and Reach: Synthesising Body Motion for Gaze-Primed Object Reach

Masashi Hatano, Saptarshi Sinha, Jacob Chalk, Wei-Hong Li, Hideo Saito, Dima Damen

TL;DR

This work addresses the gap in synthesizing gaze-primed object reach by curating a large Prime & Reach (P&R) dataset from five public egocentric datasets and training a diffusion-based motion model conditioned on text, initial state, and a goal (either full pose or object location). It introduces a new Prime Success metric to evaluate gaze priming and demonstrates substantial improvements over baselines across multiple datasets, with the strongest gains when conditioning on the full goal pose and meaningful gains when conditioning on object location. The approach integrates egocentric gaze data with full-body motion via EgoAllo for pose estimation, pretraining on Nymeria to learn fine-grained everyday motion priors, and careful ablations that highlight the impact of conditioning choices and pretraining. The resulting P&R model advances realistic gaze-primed motion synthesis, offering potential benefits for animation, AR/VR, and robotics where anticipatory, gaze-guided actions are crucial.

Abstract

Human motion generation is a challenging task that aims to create realistic motion imitating natural human behaviour. We focus on the well-studied behaviour of priming an object/location for pick up or put down -- that is, the spotting of an object/location from a distance, known as gaze priming, followed by the motion of approaching and reaching the target location. To that end, we curate, for the first time, 23.7K gaze-primed human motion sequences for reaching target object locations from five publicly available datasets, i.e., HD-EPIC, MoGaze, HOT3D, ADT, and GIMO. We pre-train a text-conditioned diffusion-based motion generation model, then fine-tune it conditioned on goal pose or location, on our curated sequences. Importantly, we evaluate the ability of the generated motion to imitate natural human movement through several metrics, including the 'Reach Success' and a newly introduced 'Prime Success' metric. On the largest dataset, HD-EPIC, our model achieves 60% prime success and 89% reach success when conditioned on the goal object location.

Prime and Reach: Synthesising Body Motion for Gaze-Primed Object Reach

TL;DR

This work addresses the gap in synthesizing gaze-primed object reach by curating a large Prime & Reach (P&R) dataset from five public egocentric datasets and training a diffusion-based motion model conditioned on text, initial state, and a goal (either full pose or object location). It introduces a new Prime Success metric to evaluate gaze priming and demonstrates substantial improvements over baselines across multiple datasets, with the strongest gains when conditioning on the full goal pose and meaningful gains when conditioning on object location. The approach integrates egocentric gaze data with full-body motion via EgoAllo for pose estimation, pretraining on Nymeria to learn fine-grained everyday motion priors, and careful ablations that highlight the impact of conditioning choices and pretraining. The resulting P&R model advances realistic gaze-primed motion synthesis, offering potential benefits for animation, AR/VR, and robotics where anticipatory, gaze-guided actions are crucial.

Abstract

Human motion generation is a challenging task that aims to create realistic motion imitating natural human behaviour. We focus on the well-studied behaviour of priming an object/location for pick up or put down -- that is, the spotting of an object/location from a distance, known as gaze priming, followed by the motion of approaching and reaching the target location. To that end, we curate, for the first time, 23.7K gaze-primed human motion sequences for reaching target object locations from five publicly available datasets, i.e., HD-EPIC, MoGaze, HOT3D, ADT, and GIMO. We pre-train a text-conditioned diffusion-based motion generation model, then fine-tune it conditioned on goal pose or location, on our curated sequences. Importantly, we evaluate the ability of the generated motion to imitate natural human movement through several metrics, including the 'Reach Success' and a newly introduced 'Prime Success' metric. On the largest dataset, HD-EPIC, our model achieves 60% prime success and 89% reach success when conditioned on the goal object location.

Paper Structure

This paper contains 35 sections, 9 equations, 9 figures, 11 tables.

Figures (9)

  • Figure 1: Prime & Reach sequences from HD-EPIC perrett2025HD-EPIC, using full-body pose from EgoAllo yi2025egoallo. (Left) A sequence starting with the intention to reach the container (cyan sphere). Gaze priming is evident (gaze intersecting the object) during the approach before reaching the object. (Right) Similar behaviour is noted for priming and picking up the scale (cyan sphere). [darker colors indicate later time].
  • Figure 2: Examples of curated P&R motion sequences from five different datasets.
  • Figure 3: P&R motion diffusion model for goal-conditioned motion generation. We concatenate the initial state of the human body and the goal pose/goal object as conditions, along with a text condition describing the type of action the motion is expected to perform. This accumulated condition is injected into the transformer decoder layers, which then outputs an $N$-length motion sequence over multiple diffusion steps.
  • Figure 4: P&R performance for pick v/s put.
  • Figure 5: Qualitative results on 3 datasets: Ground truth sequence in light green, goal-pose conditioned prediction in translucent yellow, and target location conditioned generation in brown. We show the pose at the initial, prime, and reach timesteps. Prime direction for both ground truth and predictions are shown using arrows, and target object location is shown in sphere.
  • ...and 4 more figures