Table of Contents
Fetching ...

SwarmDiffusion: End-To-End Traversability-Guided Diffusion for Embodiment-Agnostic Navigation of Heterogeneous Robots

Iana Zhura, Sausar Karaf, Faryal Batool, Nipun Dhananjaya Weerakkodi Mudalige, Valerii Serpiva, Ali Alridha Abdulkarim, Aleksey Fedoseev, Didar Seyidov, Hajira Amjad, Dzmitry Tsetserukou

TL;DR

SwarmDiffusion delivers an end-to-end diffusion-based framework that jointly infers traversability and generates feasible trajectories directly from a single RGB image, without demonstrations or explicit planning. By coupling a ViT-based traversability predictor with a diffusion trajectory generator conditioned on embodiment cues, it achieves embodiment-agnostic navigation across quadrupeds and aerial robots, with cross-embodiment transfer requiring only modest additional data. Key contributions include a planner-free trajectory construction pipeline, FiLM-conditioned conditioning, and a lightweight adaptation mechanism enabling new robots to leverage learned priors. Empirical results in simulation and real-world tests show 80-100% success rates and real-time inference (~0.09 s), along with strong generalization to unseen environments and limited data regimes. The approach offers a scalable, prompt-free pathway to unified traversability reasoning and trajectory synthesis for heterogeneous robotic platforms.

Abstract

Visual traversability estimation is critical for autonomous navigation, but existing VLM-based methods rely on hand-crafted prompts, generalize poorly across embodiments, and output only traversability maps, leaving trajectory generation to slow external planners. We propose SwarmDiffusion, a lightweight end-to-end diffusion model that jointly predicts traversability and generates a feasible trajectory from a single RGB image. To remove the need for annotated or planner-produced paths, we introduce a planner-free trajectory construction pipeline based on randomized waypoint sampling, Bezier smoothing, and regularization enforcing connectivity, safety, directionality, and path thinness. This enables learning stable motion priors without demonstrations. SwarmDiffusion leverages VLM-derived supervision without prompt engineering and conditions the diffusion process on a compact embodiment state, producing physically consistent, traversable paths that transfer across different robot platforms. Across indoor environments and two embodiments (quadruped and aerial), the method achieves 80-100% navigation success and 0.09s inference, and adapts to a new robot using only-500 additional visual samples. It generalizes reliably to unseen environments in simulation and real-world trials, offering a scalable, prompt-free approach to unified traversability reasoning and trajectory generation.

SwarmDiffusion: End-To-End Traversability-Guided Diffusion for Embodiment-Agnostic Navigation of Heterogeneous Robots

TL;DR

SwarmDiffusion delivers an end-to-end diffusion-based framework that jointly infers traversability and generates feasible trajectories directly from a single RGB image, without demonstrations or explicit planning. By coupling a ViT-based traversability predictor with a diffusion trajectory generator conditioned on embodiment cues, it achieves embodiment-agnostic navigation across quadrupeds and aerial robots, with cross-embodiment transfer requiring only modest additional data. Key contributions include a planner-free trajectory construction pipeline, FiLM-conditioned conditioning, and a lightweight adaptation mechanism enabling new robots to leverage learned priors. Empirical results in simulation and real-world tests show 80-100% success rates and real-time inference (~0.09 s), along with strong generalization to unseen environments and limited data regimes. The approach offers a scalable, prompt-free pathway to unified traversability reasoning and trajectory synthesis for heterogeneous robotic platforms.

Abstract

Visual traversability estimation is critical for autonomous navigation, but existing VLM-based methods rely on hand-crafted prompts, generalize poorly across embodiments, and output only traversability maps, leaving trajectory generation to slow external planners. We propose SwarmDiffusion, a lightweight end-to-end diffusion model that jointly predicts traversability and generates a feasible trajectory from a single RGB image. To remove the need for annotated or planner-produced paths, we introduce a planner-free trajectory construction pipeline based on randomized waypoint sampling, Bezier smoothing, and regularization enforcing connectivity, safety, directionality, and path thinness. This enables learning stable motion priors without demonstrations. SwarmDiffusion leverages VLM-derived supervision without prompt engineering and conditions the diffusion process on a compact embodiment state, producing physically consistent, traversable paths that transfer across different robot platforms. Across indoor environments and two embodiments (quadruped and aerial), the method achieves 80-100% navigation success and 0.09s inference, and adapts to a new robot using only-500 additional visual samples. It generalizes reliably to unseen environments in simulation and real-world trials, offering a scalable, prompt-free approach to unified traversability reasoning and trajectory generation.

Paper Structure

This paper contains 55 sections, 32 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Cross-embodiment traversability transfer and feasible trajectory generation. The legged robot (source) explores the environment and builds a traversability prior (blue path), which the aerial robot (target) uses for safe navigation through the shared workspace (green volume).
  • Figure 2: The proposed model consists of two interconnected components: (1) Traversability Student model, where a frozen visual encoder and state encoder jointly modulate features via FiLM to produce a traversability prediction distilled from a vision–language model (VLM); and (2) Diffusion-based Trajectory Generation, where the UNet progressively denoises a random trajectory $x_t$ conditioned on the modulated visual features and start–goal vector, yielding feasible and safe paths $x_0$. The process repeats for $N$ denoising steps.
  • Figure 3: UAV simulation architecture.
  • Figure 4: Qualitative comparison between baseline and our method across drone and quadruped platforms. Rows show (i) baseline trajectories, (ii) baseline traversability, (iii) our trajectories, and (iv) our predicted traversability maps. Our model exhibits smoother, safer, and embodiment-aware path generation.