Table of Contents
Fetching ...

SOUS VIDE: Cooking Visual Drone Navigation Policies in a Gaussian Splatting Vacuum

JunEn Low, Maximilian Adang, Javier Yu, Keiko Nagami, Mac Schwager

TL;DR

The paper tackles the sim-to-real gap in end-to-end visuomotor drone navigation by introducing SOUS VIDE, which combines FiGS, a GSplat-based photorealistic simulator, with an MPC expert to generate large-scale demonstrations and distill them into SV-Net, an onboard policy that processes images and observable state to output low-level thrust and body rates. A Rapid Motor Adaptation module enables online adjustments to evolving flight dynamics, improving robustness to disturbances such as mass changes, wind, and lighting variations. The approach achieves zero-shot sim-to-real transfer, validated over 105 hardware flights across diverse scenes and conditions, and demonstrates strong performance on extended and novel trajectories, with limited degradation under challenging scenarios. The work highlights the potential of GSplat-based data synthesis combined with lightweight, adaptive onboard policies to bridge simulation and real-world robotic autonomy, while outlining directions for generalist navigation and semantic goal understanding in future work.

Abstract

We propose a new simulator, training approach, and policy architecture, collectively called SOUS VIDE, for end-to-end visual drone navigation. Our trained policies exhibit zero-shot sim-to-real transfer with robust real-world performance using only onboard perception and computation. Our simulator, called FiGS, couples a computationally simple drone dynamics model with a high visual fidelity Gaussian Splatting scene reconstruction. FiGS can quickly simulate drone flights producing photorealistic images at up to 130 fps. We use FiGS to collect 100k-300k image/state-action pairs from an expert MPC with privileged state and dynamics information, randomized over dynamics parameters and spatial disturbances. We then distill this expert MPC into an end-to-end visuomotor policy with a lightweight neural architecture, called SV-Net. SV-Net processes color image, optical flow and IMU data streams into low-level thrust and body rate commands at 20 Hz onboard a drone. Crucially, SV-Net includes a learned module for low-level control that adapts at runtime to variations in drone dynamics. In a campaign of 105 hardware experiments, we show SOUS VIDE policies to be robust to 30% mass variations, 40 m/s wind gusts, 60% changes in ambient brightness, shifting or removing objects from the scene, and people moving aggressively through the drone's visual field. Code, data, and experiment videos can be found on our project page: https://stanfordmsl.github.io/SousVide/.

SOUS VIDE: Cooking Visual Drone Navigation Policies in a Gaussian Splatting Vacuum

TL;DR

The paper tackles the sim-to-real gap in end-to-end visuomotor drone navigation by introducing SOUS VIDE, which combines FiGS, a GSplat-based photorealistic simulator, with an MPC expert to generate large-scale demonstrations and distill them into SV-Net, an onboard policy that processes images and observable state to output low-level thrust and body rates. A Rapid Motor Adaptation module enables online adjustments to evolving flight dynamics, improving robustness to disturbances such as mass changes, wind, and lighting variations. The approach achieves zero-shot sim-to-real transfer, validated over 105 hardware flights across diverse scenes and conditions, and demonstrates strong performance on extended and novel trajectories, with limited degradation under challenging scenarios. The work highlights the potential of GSplat-based data synthesis combined with lightweight, adaptive onboard policies to bridge simulation and real-world robotic autonomy, while outlining directions for generalist navigation and semantic goal understanding in future work.

Abstract

We propose a new simulator, training approach, and policy architecture, collectively called SOUS VIDE, for end-to-end visual drone navigation. Our trained policies exhibit zero-shot sim-to-real transfer with robust real-world performance using only onboard perception and computation. Our simulator, called FiGS, couples a computationally simple drone dynamics model with a high visual fidelity Gaussian Splatting scene reconstruction. FiGS can quickly simulate drone flights producing photorealistic images at up to 130 fps. We use FiGS to collect 100k-300k image/state-action pairs from an expert MPC with privileged state and dynamics information, randomized over dynamics parameters and spatial disturbances. We then distill this expert MPC into an end-to-end visuomotor policy with a lightweight neural architecture, called SV-Net. SV-Net processes color image, optical flow and IMU data streams into low-level thrust and body rate commands at 20 Hz onboard a drone. Crucially, SV-Net includes a learned module for low-level control that adapts at runtime to variations in drone dynamics. In a campaign of 105 hardware experiments, we show SOUS VIDE policies to be robust to 30% mass variations, 40 m/s wind gusts, 60% changes in ambient brightness, shifting or removing objects from the scene, and people moving aggressively through the drone's visual field. Code, data, and experiment videos can be found on our project page: https://stanfordmsl.github.io/SousVide/.

Paper Structure

This paper contains 11 sections, 5 equations, 9 figures, 3 tables, 1 algorithm.

Figures (9)

  • Figure 1: SOUS VIDE overview: We train our FiGS simulator from a hand held camera. We use FiGS to generate flight demonstrations (image/state-action pairs) from an MPC expert with privileged information randomized over dynamics parameters and positional disturbances. We use this data to train our policy, SV-Net, which operates solely with onboard observations.
  • Figure 2: Dynamic rollout of 50 data samples. At each time step, the update function $f_d$ simulates the solution from the MPC expert, while the transform $T_\mathcal{C}^\mathcal{B}$ is used to extract the corresponding camera image $I$ from the GSplat.
  • Figure 3: SV-Net consists of three components: a feature extractor that processes visual information from color images, a history network that uses an RMA technique to adapt to variations in dynamics through a history of observable states, and a command network that integrates the outputs of these components with observable states to generate body-rate commands.
  • Figure 4: Clockwise from top left: 1) Desired trajectory in the scene's GSplat with corresponding real-world First-Person-View (FPV) of key objects. 2) Drone hardware and frames $(\mathcal{W},\mathcal{B},\mathcal{C})$. We use an Orin Nano and PixRacer Pro for control, while sensing is handled by the PixRacer's IMU, an ARK Flow sensor, and the D435's monocular camera. Motion capture markers provide ground truth. 3) 3D position and velocity performance of the policies in Section \ref{['ssec:architecture_experiments']}.
  • Figure 5: SV-Net history network's estimate of $\hat{c}$ with mean $\mu_{\hat{c}}$ overlaid for Section \ref{['ssec:architecture_experiments']} flights in simulation (left) and real-world (right).
  • ...and 4 more figures