Learning Predictive Visuomotor Coordination

Wenqi Jia; Bolin Lai; Miao Liu; Danfei Xu; James M. Rehg

Learning Predictive Visuomotor Coordination

Wenqi Jia, Bolin Lai, Miao Liu, Danfei Xu, James M. Rehg

TL;DR

The paper tackles predicting future visuomotor coordination by forecasting head pose $H$, gaze $G$, and upper-body joints $U$ from past states $S$ and egocentric video, formalizing $S=\\

Abstract

Understanding and predicting human visuomotor coordination is crucial for applications in robotics, human-computer interaction, and assistive technologies. This work introduces a forecasting-based task for visuomotor modeling, where the goal is to predict head pose, gaze, and upper-body motion from egocentric visual and kinematic observations. We propose a \textit{Visuomotor Coordination Representation} (VCR) that learns structured temporal dependencies across these multimodal signals. We extend a diffusion-based motion modeling framework that integrates egocentric vision and kinematic sequences, enabling temporally coherent and accurate visuomotor predictions. Our approach is evaluated on the large-scale EgoExo4D dataset, demonstrating strong generalization across diverse real-world activities. Our results highlight the importance of multimodal integration in understanding visuomotor coordination, contributing to research in visuomotor learning and human behavior modeling.

Learning Predictive Visuomotor Coordination

TL;DR

Abstract

Learning Predictive Visuomotor Coordination

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)