Table of Contents
Fetching ...

Learning Predictive Visuomotor Coordination

Wenqi Jia, Bolin Lai, Miao Liu, Danfei Xu, James M. Rehg

TL;DR

The paper tackles predicting future visuomotor coordination by forecasting head pose $H$, gaze $G$, and upper-body joints $U$ from past states $S$ and egocentric video, formalizing $S=\\

Abstract

Understanding and predicting human visuomotor coordination is crucial for applications in robotics, human-computer interaction, and assistive technologies. This work introduces a forecasting-based task for visuomotor modeling, where the goal is to predict head pose, gaze, and upper-body motion from egocentric visual and kinematic observations. We propose a \textit{Visuomotor Coordination Representation} (VCR) that learns structured temporal dependencies across these multimodal signals. We extend a diffusion-based motion modeling framework that integrates egocentric vision and kinematic sequences, enabling temporally coherent and accurate visuomotor predictions. Our approach is evaluated on the large-scale EgoExo4D dataset, demonstrating strong generalization across diverse real-world activities. Our results highlight the importance of multimodal integration in understanding visuomotor coordination, contributing to research in visuomotor learning and human behavior modeling.

Learning Predictive Visuomotor Coordination

TL;DR

The paper tackles predicting future visuomotor coordination by forecasting head pose , gaze , and upper-body joints from past states and egocentric video, formalizing $S=\\

Abstract

Understanding and predicting human visuomotor coordination is crucial for applications in robotics, human-computer interaction, and assistive technologies. This work introduces a forecasting-based task for visuomotor modeling, where the goal is to predict head pose, gaze, and upper-body motion from egocentric visual and kinematic observations. We propose a \textit{Visuomotor Coordination Representation} (VCR) that learns structured temporal dependencies across these multimodal signals. We extend a diffusion-based motion modeling framework that integrates egocentric vision and kinematic sequences, enabling temporally coherent and accurate visuomotor predictions. Our approach is evaluated on the large-scale EgoExo4D dataset, demonstrating strong generalization across diverse real-world activities. Our results highlight the importance of multimodal integration in understanding visuomotor coordination, contributing to research in visuomotor learning and human behavior modeling.

Paper Structure

This paper contains 25 sections, 5 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: We represent the Human Visuomotor System as a joint encoding of Head Pose, 3D Gaze Direction, and Upper-Body Joints. Given a sequence of visuomotor inputs and egocentric frames, our goal is to predict how the system coordinates its movements in the future. As forecasted egocentric frames are unavailable, predicted 2D gaze is mapped by finding the intersection of the gaze ray and the environment.
  • Figure 2: Visualizing the Visuomotor Coordination Representation by mapping it onto a human mesh for better interpretability.
  • Figure 3: Illustration of the canonicalization process for visuomotor states that help mitigate the effects of absolute head motion and viewpoint variations.
  • Figure 4: Model Architecture.
  • Figure 5: Visualization of predicted visuomotor coordination across diverse real-world scenes. Each row corresponds to a different scene, while columns illustrate the temporal evolution of visuomotor coordination. The first column represents the observed states, while subsequent columns display predictions from time step 2 to 10. Predicted poses and ground truth poses are overlaid for interpretability. View in color for best results.