Table of Contents
Fetching ...

Reasoning in visual navigation of end-to-end trained agents: a dynamical systems approach

Steeven Janny, Hervé Poirier, Leonid Antsfeld, Guillaume Bono, Gianluca Monaci, Boris Chidlovskii, Francesco Giuliari, Alessio Del Bue, Christian Wolf

TL;DR

This work investigates what end-to-end visual navigation policies learn when trained with realistic robot motion and deployed on real hardware. Through 262 real-robot episodes and simulator augmentation with a second-order dynamical model, the authors show that end-to-end policies develop a latent dynamical system supporting open-loop prediction corrected by sensing and augmented by episodic memory of scene structure and exploration, with long-horizon value signals suggesting planning tendencies. Probing analyses reveal short-to-medium horizon pose prediction from the latent state and a Kalman-like correction mechanism, though explicit long-range planning is not strongly evidenced. The study also demonstrates that memory and motion-model realism improve sim2real transfer, and introduces diagnostic tools such as the distance-to-belief metric and probing networks to study dynamics and memory content. Overall, the work advances understanding of how grounding in real motion shapes reasoning, planning, and control in embodied agents, and provides tools to analyze these emergent capabilities.

Abstract

Progress in Embodied AI has made it possible for end-to-end-trained agents to navigate in photo-realistic environments with high-level reasoning and zero-shot or language-conditioned behavior, but benchmarks are still dominated by simulation. In this work, we focus on the fine-grained behavior of fast-moving real robots and present a large-scale experimental study involving \numepisodes{} navigation episodes in a real environment with a physical robot, where we analyze the type of reasoning emerging from end-to-end training. In particular, we study the presence of realistic dynamics which the agent learned for open-loop forecasting, and their interplay with sensing. We analyze the way the agent uses latent memory to hold elements of the scene structure and information gathered during exploration. We probe the planning capabilities of the agent, and find in its memory evidence for somewhat precise plans over a limited horizon. Furthermore, we show in a post-hoc analysis that the value function learned by the agent relates to long-term planning. Put together, our experiments paint a new picture on how using tools from computer vision and sequential decision making have led to new capabilities in robotics and control. An interactive tool is available at europe.naverlabs.com/research/publications/reasoning-in-visual-navigation-of-end-to-end-trained-agents.

Reasoning in visual navigation of end-to-end trained agents: a dynamical systems approach

TL;DR

This work investigates what end-to-end visual navigation policies learn when trained with realistic robot motion and deployed on real hardware. Through 262 real-robot episodes and simulator augmentation with a second-order dynamical model, the authors show that end-to-end policies develop a latent dynamical system supporting open-loop prediction corrected by sensing and augmented by episodic memory of scene structure and exploration, with long-horizon value signals suggesting planning tendencies. Probing analyses reveal short-to-medium horizon pose prediction from the latent state and a Kalman-like correction mechanism, though explicit long-range planning is not strongly evidenced. The study also demonstrates that memory and motion-model realism improve sim2real transfer, and introduces diagnostic tools such as the distance-to-belief metric and probing networks to study dynamics and memory content. Overall, the work advances understanding of how grounding in real motion shapes reasoning, planning, and control in embodied agents, and provides tools to analyze these emergent capabilities.

Abstract

Progress in Embodied AI has made it possible for end-to-end-trained agents to navigate in photo-realistic environments with high-level reasoning and zero-shot or language-conditioned behavior, but benchmarks are still dominated by simulation. In this work, we focus on the fine-grained behavior of fast-moving real robots and present a large-scale experimental study involving \numepisodes{} navigation episodes in a real environment with a physical robot, where we analyze the type of reasoning emerging from end-to-end training. In particular, we study the presence of realistic dynamics which the agent learned for open-loop forecasting, and their interplay with sensing. We analyze the way the agent uses latent memory to hold elements of the scene structure and information gathered during exploration. We probe the planning capabilities of the agent, and find in its memory evidence for somewhat precise plans over a limited horizon. Furthermore, we show in a post-hoc analysis that the value function learned by the agent relates to long-term planning. Put together, our experiments paint a new picture on how using tools from computer vision and sequential decision making have led to new capabilities in robotics and control. An interactive tool is available at europe.naverlabs.com/research/publications/reasoning-in-visual-navigation-of-end-to-end-trained-agents.

Paper Structure

This paper contains 20 sections, 7 equations, 15 figures, 5 tables.

Figures (15)

  • Figure 1: In a large-scale analysis of 262 episodes of a real robot in a real environment, we report on the type of reasoning emerging after end-to-end training agents with realistic motion: they learn a dynamical motion model exploited with open-loop forecasting and corrected by sensing, latent scene structure, exploration information, and long-term value estimates.
  • Figure 2: We build upon bono2024learning with improvements allowing the SR in real setup to increase by +50%.
  • Figure 3: Comparison of agents: (a) 4 motion commands $\{$FORWARD 25cm, TURN_LEFT $10^{\circ}$, TURN_RIGHT $10^{\circ}$, STOP$\}$, no dynamical model; (b) 28 pairs of instant+constant velocities (no dynamical model); (c) 28 pairs of velocities+identified realistic dynamical model. This agent has been evaluated in the real scenario 4 times with 20 episodes each, we report mean and std.dev over these 4 experiments. *training only.
  • Figure 4: Input vs. model sensitivity of two different trained agents under disturbance scenarios on HM3D/250: Left: agent "D28-dynamics" / Table \ref{['tab:variants']}(c), trained with the dynamical model, shows good robustness to changes in the dynamical model, but high sensibility to the odometry. Right: agent "D28-instant" / Table \ref{['tab:variants']}(b), trained w/o dynamical model seems to overfit to the simulated "teleportation" behavior. The corrupted environments are the same for both and chosen as disturbances wrt. to the real dynamics, but $\Delta E$ and $D_{\text{belief}}$ are calculated to the agents' respective training environments $\rightarrow$ they are bigger for D28-instant (cf. right figure). An interactive tool with a dynamical model playground is available at \urlwebsitehttp.
  • Figure 5: The causal relationship betw. environment shifts and agent $\pi$ performance.
  • ...and 10 more figures