Table of Contents
Fetching ...

UAVTwin: Neural Digital Twins for UAVs using Gaussian Splatting

Jaehoon Choi, Dongki Jung, Yonghan Lee, Sungmin Eum, Dinesh Manocha, Heesung Kwon

TL;DR

UAVTwin advances UAV perception by building photorealistic digital twins from real-world UAV footage using Multi-sequence Gaussian Splatting (MsGS) to reconstruct backgrounds with large appearance variation and by instrumenting mask refinement to handle dynamic objects. Foreground humans are inserted via Blender with synthetic trajectories and motion from AMASS/SynBody, enabling realistic scene composition and rich ground-truth annotations, while a two-stage training strategy aligns Gaussians with geometry and improves novel-view rendering. Quantitative results show improved neural rendering fidelity and meaningful detection gains (mAP) when augmented data are used to train UAV perception models, though there remains a domain gap between synthetic humans and real data. The framework offers a practical pathway for generating diverse, labeled UAV data to boost perception tasks, with future work aimed at reducing the remaining domain gap through more realistic avatars and insertion techniques.

Abstract

We present UAVTwin, a method for creating digital twins from real-world environments and facilitating data augmentation for training downstream models embedded in unmanned aerial vehicles (UAVs). Specifically, our approach focuses on synthesizing foreground components, such as various human instances in motion within complex scene backgrounds, from UAV perspectives. This is achieved by integrating 3D Gaussian Splatting (3DGS) for reconstructing backgrounds along with controllable synthetic human models that display diverse appearances and actions in multiple poses. To the best of our knowledge, UAVTwin is the first approach for UAV-based perception that is capable of generating high-fidelity digital twins based on 3DGS. The proposed work significantly enhances downstream models through data augmentation for real-world environments with multiple dynamic objects and significant appearance variations-both of which typically introduce artifacts in 3DGS-based modeling. To tackle these challenges, we propose a novel appearance modeling strategy and a mask refinement module to enhance the training of 3D Gaussian Splatting. We demonstrate the high quality of neural rendering by achieving a 1.23 dB improvement in PSNR compared to recent methods. Furthermore, we validate the effectiveness of data augmentation by showing a 2.5% to 13.7% improvement in mAP for the human detection task.

UAVTwin: Neural Digital Twins for UAVs using Gaussian Splatting

TL;DR

UAVTwin advances UAV perception by building photorealistic digital twins from real-world UAV footage using Multi-sequence Gaussian Splatting (MsGS) to reconstruct backgrounds with large appearance variation and by instrumenting mask refinement to handle dynamic objects. Foreground humans are inserted via Blender with synthetic trajectories and motion from AMASS/SynBody, enabling realistic scene composition and rich ground-truth annotations, while a two-stage training strategy aligns Gaussians with geometry and improves novel-view rendering. Quantitative results show improved neural rendering fidelity and meaningful detection gains (mAP) when augmented data are used to train UAV perception models, though there remains a domain gap between synthetic humans and real data. The framework offers a practical pathway for generating diverse, labeled UAV data to boost perception tasks, with future work aimed at reducing the remaining domain gap through more realistic avatars and insertion techniques.

Abstract

We present UAVTwin, a method for creating digital twins from real-world environments and facilitating data augmentation for training downstream models embedded in unmanned aerial vehicles (UAVs). Specifically, our approach focuses on synthesizing foreground components, such as various human instances in motion within complex scene backgrounds, from UAV perspectives. This is achieved by integrating 3D Gaussian Splatting (3DGS) for reconstructing backgrounds along with controllable synthetic human models that display diverse appearances and actions in multiple poses. To the best of our knowledge, UAVTwin is the first approach for UAV-based perception that is capable of generating high-fidelity digital twins based on 3DGS. The proposed work significantly enhances downstream models through data augmentation for real-world environments with multiple dynamic objects and significant appearance variations-both of which typically introduce artifacts in 3DGS-based modeling. To tackle these challenges, we propose a novel appearance modeling strategy and a mask refinement module to enhance the training of 3D Gaussian Splatting. We demonstrate the high quality of neural rendering by achieving a 1.23 dB improvement in PSNR compared to recent methods. Furthermore, we validate the effectiveness of data augmentation by showing a 2.5% to 13.7% improvement in mAP for the human detection task.

Paper Structure

This paper contains 28 sections, 12 equations, 13 figures, 10 tables, 1 algorithm.

Figures (13)

  • Figure 1: UAVTwin processes video captured by the UAV as input, enabling data generation for training UAV-based human recognition methods.
  • Figure 2: UAVTwin framework. In Section \ref{['sec:buildingdigialtwin']}, our approach first constructs a digital twin using UAV-based images captured at different times. We introduce MsGS, a novel 3DGS method to analyze varying appearance images and reconstruct a clean mesh, Gaussian splats, and an MLP for novel-view synthesis. Then, in Section \ref{['sec:neuraldatageneration']}, our method generates data by compositing foreground humans rendered in Blender with backgrounds rendered using trained Gaussian splats.
  • Figure 3: Mask Refinement. (a) is an example of training images with dynamic objects. (b) is its segmentation masks $\overline{M}$ using GroundingSAM ren2024grounded. (c) is the entity segmentation masks $\hat{M}$otonari2024entity. Based on the SAM masks in (b), we add entity masks with high photometric loss (red dotted boxes) and remove those with low photometric loss (blue dotted boxes), resulting in the refined masks shown in (e).
  • Figure 4: The core components of MsGS. From a video sequence captured at different times with varying appearances, we define a sequence embedding $q_{i}$. For novel-view image rendering, we select a specific sequence. Our color MLP $f$ takes as input the sequence embedding, viewing direction embedding, per-Gaussian embedding, and base color. The output color is modulated to account for appearance variations.
  • Figure 5: Qualitative Results of Synthetized Data. The first row illustrates the camera trajectory, while the second row presents a sample training image from the sequence used as a per-sequence embedding input for rendering our MsGS. Data is generated using different trajectories with sequence embeddings extracted from training data: (a)+(e) orbit (Noon 1.2.2), (b)+(f) altitude varying (Morning 2.1.1), (c)+(g) yaw rotation (Morning 2.1.7), and (d)+(h) translational (Noon 2.2.2).
  • ...and 8 more figures