Table of Contents
Fetching ...

Flying in Clutter on Monocular RGB by Learning in 3D Radiance Fields with Domain Adaptation

Xijie Huang, Jinhan Li, Tianyue Wu, Xin Zhou, Zhichao Han, Fei Gao

TL;DR

This work tackles autonomous UAV navigation in clutter using only monocular RGB input by learning policies in photorealistic 3D Gaussian Splatting (3DGS) environments and bridging the sim-to-real gap with adversarial domain adaptation and domain randomization. It introduces an end-to-end RGB-based RL framework with an actor-critic architecture and a depth-privileged critic, paired with accelerated 3DGS rendering via pruning. The method demonstrates zero-shot transfer to real-world flights under varying obstacle layouts and illumination, supported by ablations and latent-space analyses that clarify the roles of DA and DR in reducing domain shift. The results indicate a practical pathway for monocular RGB navigation on lightweight UAVs and point toward scaling 3DGS-based training to diverse, large-scale datasets and ecosystem-level deployment.

Abstract

Modern autonomous navigation systems predominantly rely on lidar and depth cameras. However, a fundamental question remains: Can flying robots navigate in clutter using solely monocular RGB images? Given the prohibitive costs of real-world data collection, learning policies in simulation offers a promising path. Yet, deploying such policies directly in the physical world is hindered by the significant sim-to-real perception gap. Thus, we propose a framework that couples the photorealism of 3D Gaussian Splatting (3DGS) environments with Adversarial Domain Adaptation. By training in high-fidelity simulation while explicitly minimizing feature discrepancy, our method ensures the policy relies on domain-invariant cues. Experimental results demonstrate that our policy achieves robust zero-shot transfer to the physical world, enabling safe and agile flight in unstructured environments with varying illumination.

Flying in Clutter on Monocular RGB by Learning in 3D Radiance Fields with Domain Adaptation

TL;DR

This work tackles autonomous UAV navigation in clutter using only monocular RGB input by learning policies in photorealistic 3D Gaussian Splatting (3DGS) environments and bridging the sim-to-real gap with adversarial domain adaptation and domain randomization. It introduces an end-to-end RGB-based RL framework with an actor-critic architecture and a depth-privileged critic, paired with accelerated 3DGS rendering via pruning. The method demonstrates zero-shot transfer to real-world flights under varying obstacle layouts and illumination, supported by ablations and latent-space analyses that clarify the roles of DA and DR in reducing domain shift. The results indicate a practical pathway for monocular RGB navigation on lightweight UAVs and point toward scaling 3DGS-based training to diverse, large-scale datasets and ecosystem-level deployment.

Abstract

Modern autonomous navigation systems predominantly rely on lidar and depth cameras. However, a fundamental question remains: Can flying robots navigate in clutter using solely monocular RGB images? Given the prohibitive costs of real-world data collection, learning policies in simulation offers a promising path. Yet, deploying such policies directly in the physical world is hindered by the significant sim-to-real perception gap. Thus, we propose a framework that couples the photorealism of 3D Gaussian Splatting (3DGS) environments with Adversarial Domain Adaptation. By training in high-fidelity simulation while explicitly minimizing feature discrepancy, our method ensures the policy relies on domain-invariant cues. Experimental results demonstrate that our policy achieves robust zero-shot transfer to the physical world, enabling safe and agile flight in unstructured environments with varying illumination.

Paper Structure

This paper contains 26 sections, 9 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Overview of our RGB-based navigation framework. The top panels demonstrate that our method effectively mitigates the sim-to-real gap. The bottom panel illustrates the pipeline of 3D environment construction and domain adaptation.
  • Figure 2: Pipeline for constructing the 3DGS-based simulation environment. To accelerate rendering, we employ Speedy Splat for model pruning and utilize gsplat as a parallelized rasterization backend. The aligned point cloud maps are then imported into Isaac Sim to enable depth rendering and collision detection.
  • Figure 3: Overview of the proposed RL training framework. The architecture features an asymmetric actor-critic structure and employs an adversarial domain adaptation module to bridge the sim-to-real gap.
  • Figure 4: Training curves for ablation studies. We compare the mean reward convergence across different policy inputs and sim-to-real strategies. Abbreviations: DA (Domain Adaptation), DR (Visual Domain Randomization), and DY (Dynamic Domain Randomization). Note: DY is applied as a fundamental component to mitigate the dynamics gap between simulation and reality. The DA+DY, DR+DY and DY are trained based on the Proposed, while DR+DA+DY is trained based on DR+DY.
  • Figure 5: t-SNE visualization of the latent feature space. The left column evaluates domain alignment between 3DGS and Real-world inputs (measured by GSI), while the right column illustrates feature discriminability across different data-augmented environments (measured by Classification Accuracy).
  • ...and 2 more figures