Table of Contents
Fetching ...

Embodiment-Agnostic Navigation Policy Trained with Visual Demonstrations

Nimrod Curtis, Osher Azulay, Avishai Sintov

TL;DR

ViDEN introduces an embodiment-agnostic navigation framework trained from depth-based visual demonstrations to enable robust, collision-free pursuit of dynamic targets. By leveraging a diffusion-based behavior cloning policy and a compact depth-driven state representation, ViDEN achieves task-centric tracking without robot-specific topologies or pre-defined target RGB images. The approach demonstrates high data efficiency (≈1.5 hours of robot-independent demonstrations) and strong generalization, including zero-shot transfer and effective fine-tuning with modest additional data. This work offers practical, scalable routing for diverse robots in indoor and outdoor environments, with open-source code to benchmark and extend the methodology.

Abstract

Learning to navigate in unstructured environments is a challenging task for robots. While reinforcement learning can be effective, it often requires extensive data collection and can pose risk. Learning from expert demonstrations, on the other hand, offers a more efficient approach. However, many existing methods rely on specific robot embodiments, pre-specified target images and require large datasets. We propose the Visual Demonstration-based Embodiment-agnostic Navigation (ViDEN) framework, a novel framework that leverages visual demonstrations to train embodiment-agnostic navigation policies. ViDEN utilizes depth images to reduce input dimensionality and relies on relative target positions, making it more adaptable to diverse environments. By training a diffusion-based policy on task-centric and embodiment-agnostic demonstrations, ViDEN can generate collision-free and adaptive trajectories in real-time. Our experiments on human reaching and tracking demonstrate that ViDEN outperforms existing methods, requiring a small amount of data and achieving superior performance in various indoor and outdoor navigation scenarios. Project website: https://nimicurtis.github.io/ViDEN/.

Embodiment-Agnostic Navigation Policy Trained with Visual Demonstrations

TL;DR

ViDEN introduces an embodiment-agnostic navigation framework trained from depth-based visual demonstrations to enable robust, collision-free pursuit of dynamic targets. By leveraging a diffusion-based behavior cloning policy and a compact depth-driven state representation, ViDEN achieves task-centric tracking without robot-specific topologies or pre-defined target RGB images. The approach demonstrates high data efficiency (≈1.5 hours of robot-independent demonstrations) and strong generalization, including zero-shot transfer and effective fine-tuning with modest additional data. This work offers practical, scalable routing for diverse robots in indoor and outdoor environments, with open-source code to benchmark and extend the methodology.

Abstract

Learning to navigate in unstructured environments is a challenging task for robots. While reinforcement learning can be effective, it often requires extensive data collection and can pose risk. Learning from expert demonstrations, on the other hand, offers a more efficient approach. However, many existing methods rely on specific robot embodiments, pre-specified target images and require large datasets. We propose the Visual Demonstration-based Embodiment-agnostic Navigation (ViDEN) framework, a novel framework that leverages visual demonstrations to train embodiment-agnostic navigation policies. ViDEN utilizes depth images to reduce input dimensionality and relies on relative target positions, making it more adaptable to diverse environments. By training a diffusion-based policy on task-centric and embodiment-agnostic demonstrations, ViDEN can generate collision-free and adaptive trajectories in real-time. Our experiments on human reaching and tracking demonstrate that ViDEN outperforms existing methods, requiring a small amount of data and achieving superior performance in various indoor and outdoor navigation scenarios. Project website: https://nimicurtis.github.io/ViDEN/.
Paper Structure (12 sections, 2 equations, 7 figures, 2 tables)

This paper contains 12 sections, 2 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: (Top) task-centric and task-agnostic demonstrations are collected without any robot using a hand-held depth camera. The demonstration data is used to train the ViDEN policy for collision-free navigation to a target object. (Bottom) The trained policy can be deployed on any robot to navigate between obstacles and reach the target.
  • Figure 2: Architecture of the proposed Visual Demonstration-based Embodiment-agnostic Navigation (ViDEN) framework. The observed RGB image $\mathbf{I}_t$ is used to detect an object of interest and derive the target $\mathbf{s}_t$ to reach with respect to the object. The depth information $\mathbf{J}_t$ from the camera is used by the diffusion policy to generate an action trajectory $\tilde{\tau}_t$ to reach the target. The trajectory is conditioned with intermediate goals $\mathbf{g}_{t^*}$, easing the tracking of dynamic targets.
  • Figure 3: Policy deployment in a hard level outdoor environment. The robot avoids the obstacles and reaches the human target.
  • Figure 4: Policy deployment in a hard level outdoor environment with a dynamic human target.
  • Figure 5: Demonstrations of various challenging scenarios imposed on the robot including (a) low-light condition, (b) physical robot perturbation and (c) an obstacle pushed in front of the robot's path.
  • ...and 2 more figures