Table of Contents
Fetching ...

EgoCogNav: Cognition-aware Human Egocentric Navigation

Zhiwen Qiu, Ziang Liu, Wenqian Niu, Tapomayukh Bhattacharjee, Saleh Kalantari

TL;DR

EgoCogNav tackles cognition-aware egocentric navigation by jointly forecasting future body-frame trajectories, head motion, and moment-to-moment perceived uncertainty from multimodal first-person data. The model fuses scene features from a pre-trained backbone with motion cues, employing adaptive goal conditioning and dual decoders, while regularizing with auxiliary tasks to improve robustness. A new CEN dataset of 6 hours across diverse indoor/outdoor scenes with cognitive annotations enables training and evaluation of cognition-informed forecasts, showing that predicted uncertainty aligns with real navigation challenges like hesitation and backtracking. The approach advances safe, socially aware navigation and assistive wayfinding by modeling internal cognitive states alongside motion, and it opens avenues for richer 3D/contextual reasoning and multi-hypothesis planning in future work.

Abstract

Modeling the cognitive and experiential factors of human navigation is central to deepening our understanding of human-environment interaction and to enabling safe social navigation and effective assistive wayfinding. Most existing methods focus on forecasting motions in fully observed scenes and often neglect human factors that capture how people feel and respond to space. To address this gap, We propose EgoCogNav, a multimodal egocentric navigation framework that predicts perceived path uncertainty as a latent state and jointly forecasts trajectories and head motion by fusing scene features with sensory cues. To facilitate research in the field, we introduce the Cognition-aware Egocentric Navigation (CEN) dataset consisting 6 hours of real-world egocentric recordings capturing diverse navigation behaviors in real-world scenarios. Experiments show that EgoCogNav learns the perceived uncertainty that highly correlates with human-like behaviors such as scanning, hesitation, and backtracking while generalizing to unseen environments.

EgoCogNav: Cognition-aware Human Egocentric Navigation

TL;DR

EgoCogNav tackles cognition-aware egocentric navigation by jointly forecasting future body-frame trajectories, head motion, and moment-to-moment perceived uncertainty from multimodal first-person data. The model fuses scene features from a pre-trained backbone with motion cues, employing adaptive goal conditioning and dual decoders, while regularizing with auxiliary tasks to improve robustness. A new CEN dataset of 6 hours across diverse indoor/outdoor scenes with cognitive annotations enables training and evaluation of cognition-informed forecasts, showing that predicted uncertainty aligns with real navigation challenges like hesitation and backtracking. The approach advances safe, socially aware navigation and assistive wayfinding by modeling internal cognitive states alongside motion, and it opens avenues for richer 3D/contextual reasoning and multi-hypothesis planning in future work.

Abstract

Modeling the cognitive and experiential factors of human navigation is central to deepening our understanding of human-environment interaction and to enabling safe social navigation and effective assistive wayfinding. Most existing methods focus on forecasting motions in fully observed scenes and often neglect human factors that capture how people feel and respond to space. To address this gap, We propose EgoCogNav, a multimodal egocentric navigation framework that predicts perceived path uncertainty as a latent state and jointly forecasts trajectories and head motion by fusing scene features with sensory cues. To facilitate research in the field, we introduce the Cognition-aware Egocentric Navigation (CEN) dataset consisting 6 hours of real-world egocentric recordings capturing diverse navigation behaviors in real-world scenarios. Experiments show that EgoCogNav learns the perceived uncertainty that highly correlates with human-like behaviors such as scanning, hesitation, and backtracking while generalizing to unseen environments.

Paper Structure

This paper contains 16 sections, 5 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Given a past window of motion, head rotations, gaze, and navigational goal, our model jointly predicts a future body-frame trajectory, head poses, and the current state of perceived uncertainty. This setting reflects real-world navigation challenges in environments where the model must anticipate internal cognitive state and making decisions for subsequent head motion and movement.
  • Figure 2: EgoCogNav architecture. Given a past egocentric video and sensory inputs, we fuse them into a shared representation with adaptive goal conditioning to jointly predict the current state of perceived uncertainty, future trajectory and head motion.
  • Figure 3: Qualitative visualizations. The top row presents BEV examples under high-uncertainty and behavior-eliciting scenarios. For each scenario on the bottom row, the left panel shows a BEV overlay with past trajectory (gray), ground-truth future, predicted future, and overlaid path uncertainty. The right panels show time-aligned egocentric frames (t+1 to t+3s) with ground-truth (red) and predicted (green) head positions. Environments were also highlighted with red dots for those that trigger uncertain behaviors.
  • Figure 4: Failure cases. Two failure cases highlight limits in long-horizon scene memory and the lack of multi-hypothesis futures.