Reasoning in visual navigation of end-to-end trained agents: a dynamical systems approach
Steeven Janny, Hervé Poirier, Leonid Antsfeld, Guillaume Bono, Gianluca Monaci, Boris Chidlovskii, Francesco Giuliari, Alessio Del Bue, Christian Wolf
TL;DR
This work investigates what end-to-end visual navigation policies learn when trained with realistic robot motion and deployed on real hardware. Through 262 real-robot episodes and simulator augmentation with a second-order dynamical model, the authors show that end-to-end policies develop a latent dynamical system supporting open-loop prediction corrected by sensing and augmented by episodic memory of scene structure and exploration, with long-horizon value signals suggesting planning tendencies. Probing analyses reveal short-to-medium horizon pose prediction from the latent state and a Kalman-like correction mechanism, though explicit long-range planning is not strongly evidenced. The study also demonstrates that memory and motion-model realism improve sim2real transfer, and introduces diagnostic tools such as the distance-to-belief metric and probing networks to study dynamics and memory content. Overall, the work advances understanding of how grounding in real motion shapes reasoning, planning, and control in embodied agents, and provides tools to analyze these emergent capabilities.
Abstract
Progress in Embodied AI has made it possible for end-to-end-trained agents to navigate in photo-realistic environments with high-level reasoning and zero-shot or language-conditioned behavior, but benchmarks are still dominated by simulation. In this work, we focus on the fine-grained behavior of fast-moving real robots and present a large-scale experimental study involving \numepisodes{} navigation episodes in a real environment with a physical robot, where we analyze the type of reasoning emerging from end-to-end training. In particular, we study the presence of realistic dynamics which the agent learned for open-loop forecasting, and their interplay with sensing. We analyze the way the agent uses latent memory to hold elements of the scene structure and information gathered during exploration. We probe the planning capabilities of the agent, and find in its memory evidence for somewhat precise plans over a limited horizon. Furthermore, we show in a post-hoc analysis that the value function learned by the agent relates to long-term planning. Put together, our experiments paint a new picture on how using tools from computer vision and sequential decision making have led to new capabilities in robotics and control. An interactive tool is available at europe.naverlabs.com/research/publications/reasoning-in-visual-navigation-of-end-to-end-trained-agents.
