SOUS VIDE: Cooking Visual Drone Navigation Policies in a Gaussian Splatting Vacuum
JunEn Low, Maximilian Adang, Javier Yu, Keiko Nagami, Mac Schwager
TL;DR
The paper tackles the sim-to-real gap in end-to-end visuomotor drone navigation by introducing SOUS VIDE, which combines FiGS, a GSplat-based photorealistic simulator, with an MPC expert to generate large-scale demonstrations and distill them into SV-Net, an onboard policy that processes images and observable state to output low-level thrust and body rates. A Rapid Motor Adaptation module enables online adjustments to evolving flight dynamics, improving robustness to disturbances such as mass changes, wind, and lighting variations. The approach achieves zero-shot sim-to-real transfer, validated over 105 hardware flights across diverse scenes and conditions, and demonstrates strong performance on extended and novel trajectories, with limited degradation under challenging scenarios. The work highlights the potential of GSplat-based data synthesis combined with lightweight, adaptive onboard policies to bridge simulation and real-world robotic autonomy, while outlining directions for generalist navigation and semantic goal understanding in future work.
Abstract
We propose a new simulator, training approach, and policy architecture, collectively called SOUS VIDE, for end-to-end visual drone navigation. Our trained policies exhibit zero-shot sim-to-real transfer with robust real-world performance using only onboard perception and computation. Our simulator, called FiGS, couples a computationally simple drone dynamics model with a high visual fidelity Gaussian Splatting scene reconstruction. FiGS can quickly simulate drone flights producing photorealistic images at up to 130 fps. We use FiGS to collect 100k-300k image/state-action pairs from an expert MPC with privileged state and dynamics information, randomized over dynamics parameters and spatial disturbances. We then distill this expert MPC into an end-to-end visuomotor policy with a lightweight neural architecture, called SV-Net. SV-Net processes color image, optical flow and IMU data streams into low-level thrust and body rate commands at 20 Hz onboard a drone. Crucially, SV-Net includes a learned module for low-level control that adapts at runtime to variations in drone dynamics. In a campaign of 105 hardware experiments, we show SOUS VIDE policies to be robust to 30% mass variations, 40 m/s wind gusts, 60% changes in ambient brightness, shifting or removing objects from the scene, and people moving aggressively through the drone's visual field. Code, data, and experiment videos can be found on our project page: https://stanfordmsl.github.io/SousVide/.
