Table of Contents
Fetching ...

EgoNav: Egocentric Scene-aware Human Trajectory Prediction

Weizhuo Wang, C. Karen Liu, Monroe Kennedy

TL;DR

This work tackles ego-centric trajectory prediction for wearable robots by conditioning future motion on both the past trajectory and a rich egocentric scene representation. It introduces a diffusion‑based predictor that operates on a compact Visual Memory embedding derived from aligned RGBD and semantic data, enabling multimodal future trajectory sampling at real‑time rates. Key contributions include the Visual Memory representation, a hybrid DDIM–DDPM sampling scheme for fast yet high‑fidelity inference, and a comprehensive egocentric navigation dataset with diverse indoor–outdoor scenarios. The results demonstrate improved collision avoidance and mode coverage over baselines, validating the approach for safer, scene‑aware human–robot collaboration and informing downstream planning and imitation learning tasks.

Abstract

Wearable collaborative robots stand to assist human wearers who need fall prevention assistance or wear exoskeletons. Such a robot needs to be able to constantly adapt to the surrounding scene based on egocentric vision, and predict the ego motion of the wearer. In this work, we leveraged body-mounted cameras and sensors to anticipate the trajectory of human wearers through complex surroundings. To facilitate research in ego-motion prediction, we have collected a comprehensive walking scene navigation dataset centered on the user's perspective. We then present a method to predict human motion conditioning on the surrounding static scene. Our method leverages a diffusion model to produce a distribution of potential future trajectories, taking into account the user's observation of the environment. To that end, we introduce a compact representation to encode the user's visual memory of the surroundings, as well as an efficient sample-generating technique to speed up real-time inference of a diffusion model. We ablate our model and compare it to baselines, and results show that our model outperforms existing methods on key metrics of collision avoidance and trajectory mode coverage.

EgoNav: Egocentric Scene-aware Human Trajectory Prediction

TL;DR

This work tackles ego-centric trajectory prediction for wearable robots by conditioning future motion on both the past trajectory and a rich egocentric scene representation. It introduces a diffusion‑based predictor that operates on a compact Visual Memory embedding derived from aligned RGBD and semantic data, enabling multimodal future trajectory sampling at real‑time rates. Key contributions include the Visual Memory representation, a hybrid DDIM–DDPM sampling scheme for fast yet high‑fidelity inference, and a comprehensive egocentric navigation dataset with diverse indoor–outdoor scenarios. The results demonstrate improved collision avoidance and mode coverage over baselines, validating the approach for safer, scene‑aware human–robot collaboration and informing downstream planning and imitation learning tasks.

Abstract

Wearable collaborative robots stand to assist human wearers who need fall prevention assistance or wear exoskeletons. Such a robot needs to be able to constantly adapt to the surrounding scene based on egocentric vision, and predict the ego motion of the wearer. In this work, we leveraged body-mounted cameras and sensors to anticipate the trajectory of human wearers through complex surroundings. To facilitate research in ego-motion prediction, we have collected a comprehensive walking scene navigation dataset centered on the user's perspective. We then present a method to predict human motion conditioning on the surrounding static scene. Our method leverages a diffusion model to produce a distribution of potential future trajectories, taking into account the user's observation of the environment. To that end, we introduce a compact representation to encode the user's visual memory of the surroundings, as well as an efficient sample-generating technique to speed up real-time inference of a diffusion model. We ablate our model and compare it to baselines, and results show that our model outperforms existing methods on key metrics of collision avoidance and trajectory mode coverage.
Paper Structure (21 sections, 4 equations, 9 figures, 2 tables)

This paper contains 21 sections, 4 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Visual Representation of the Problem: Given the past in blue, and learning from single-modal ground truth in green, what are the likely future paths and their likelihood? Different red ribbon illustrates different possible modes in the scene and the size of the ribbon denotes the likelihood.
  • Figure 2: Overview of the proposed method: We maintain a 5-second buffer of logs that is most relevant to the prediction and organize them into a visual memory frame. All input and output of the prediction module are in the ego-centric frame.
  • Figure 3: Channels in Visual Memory: The visual memory integrates frames from various time steps into a single panorama. It consists of a depth channel, color channel, and intensity-encoded 8-class semantic channel. 4 channels are shown in the figure.
  • Figure 4: Comparing depth frame with visual memory: A raw depth frame from a stereo camera has only 90 degrees of narrow FOV and often misses important scene information. In the figure, the depth frame only sees the open space in front and does not capture the stairs, the right turn path, or the wall directly to the left. The black regions are the undiscovered areas when stitching frames from different time steps.
  • Figure 5: Diffusion Model: Architecture and hybrid generation details. The black and white stripes represent the time-stepping schemes of the inference. In default mode, all steps are diffused sequentially. In our hybrid mode, first perform stripped DDIM steps from 1000 to 0, then DDPM steps from n to 0. Conditions include compressed VM and past states are passed to the embedding layer and concatenated to the block output.
  • ...and 4 more figures