Table of Contents
Fetching ...

EgoMoD: Predicting Global Maps of Dynamics from Local Egocentric Observations

Iacopo Catalano, David Morilla-Cabello, Jorge Pena-Queralta, Eduardo Montijano

TL;DR

This work introduces EgoMoD, the first approach that learns to predict future MoDs directly from short egocentric video clips collected during robot operation, and offers the capacity to forecast future motion dynamics over the whole environment rather than merely extend past patterns in the robot's field of view.

Abstract

Efficient navigation in dynamic environments requires anticipating how motion patterns evolve beyond the robot's immediate perceptual range, enabling preemptive rather than purely reactive planning in crowded scenes. Maps of Dynamics (MoDs) offer a structured representation of motion tendencies in space useful for long-term global planning, but constructing them traditionally requires global environment observations over extended periods of time. We introduce EgoMoD, the first approach that learns to predict future MoDs directly from short egocentric video clips collected during robot operation. Our method learns to infer environment-wide motion tendencies from local dynamic cues using a video- and pose-conditioned architecture trained with MoDs computed from external observations as privileged supervision, allowing local observations to serve as predictive signals of global motion structure. Thanks to this, we offer the capacity to forecast future motion dynamics over the whole environment rather than merely extend past patterns in the robot's field of view. Experiments in large simulated environments show that EgoMoD accurately predicts future MoDs under limited observability, while evaluation with real images showcases its zero-shot transferability to real systems.

EgoMoD: Predicting Global Maps of Dynamics from Local Egocentric Observations

TL;DR

This work introduces EgoMoD, the first approach that learns to predict future MoDs directly from short egocentric video clips collected during robot operation, and offers the capacity to forecast future motion dynamics over the whole environment rather than merely extend past patterns in the robot's field of view.

Abstract

Efficient navigation in dynamic environments requires anticipating how motion patterns evolve beyond the robot's immediate perceptual range, enabling preemptive rather than purely reactive planning in crowded scenes. Maps of Dynamics (MoDs) offer a structured representation of motion tendencies in space useful for long-term global planning, but constructing them traditionally requires global environment observations over extended periods of time. We introduce EgoMoD, the first approach that learns to predict future MoDs directly from short egocentric video clips collected during robot operation. Our method learns to infer environment-wide motion tendencies from local dynamic cues using a video- and pose-conditioned architecture trained with MoDs computed from external observations as privileged supervision, allowing local observations to serve as predictive signals of global motion structure. Thanks to this, we offer the capacity to forecast future motion dynamics over the whole environment rather than merely extend past patterns in the robot's field of view. Experiments in large simulated environments show that EgoMoD accurately predicts future MoDs under limited observability, while evaluation with real images showcases its zero-shot transferability to real systems.
Paper Structure (23 sections, 10 equations, 7 figures, 3 tables)

This paper contains 23 sections, 10 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: EgoMoD learns to predict global maps of dynamics (bottom) from local video observations (top) from the egocentric vision of the robot, leveraging priors from a privileged expert.
  • Figure 2: EgoMoD architecture overview. A short egocentric video sequence is processed by a frozen foundation video encoder to extract spatio-temporal features. These features are combined with a learned pose embedding via a Transformer encoder that enables self attention on a combined sequence of visual patches and robot pose. The resulting representations are reshaped to a spatial grid and decoded through a convolutional and upsampling layers to produce allocentric Maps of Dynamics.
  • Figure 3: Example of construction of our Maps of Dynamics
  • Figure 4: Overview of the hospital simulation environment shown from a top-down view (top) and a perspective view (bottom).
  • Figure 5: Photorealistic office simulated environment
  • ...and 2 more figures