Table of Contents
Fetching ...

SAGE: Semantic-Driven Adaptive Gaussian Splatting in Extended Reality

Chiara Schiavo, Elena Camuffo, Leonardo Badia, Simone Milani

TL;DR

XR rendering under tight memory and compute budgets is addressed by SAGE, which fuses semantic segmentation with 3D Gaussian Splatting to perform per-semantic-category level-of-detail optimization guided by a target $SSIM_t$. The method maps 2D semantic labels onto a Structure-from-Motion point cloud and solves a constrained optimization $ \min_{i} \sum_l N_l(i) \;\text{s.t.}\; \text{SSIM}_{l,i}(d_{min,l}) \ge \text{SSIM}_t$, using a distance-aware, piecewise exponential model for $\text{SSIM}_{l,i}(d_{min,l})$ and fitting parameters per label. Evaluations on the Mip-NeRF360 dataset show that SAGE substantially reduces Gaussians and memory while maintaining comparable visual quality, with transferable per-label iterations across scenes and robust cross-view performance. The approach enables scalable, real-time XR rendering by prioritizing resources where semantic importance and viewpoint proximity demand higher fidelity.

Abstract

3D Gaussian Splatting (3DGS) has significantly improved the efficiency and realism of three-dimensional scene visualization in several applications, ranging from robotics to eXtended Reality (XR). This work presents SAGE (Semantic-Driven Adaptive Gaussian Splatting in Extended Reality), a novel framework designed to enhance the user experience by dynamically adapting the Level of Detail (LOD) of different 3DGS objects identified via a semantic segmentation. Experimental results demonstrate how SAGE effectively reduces memory and computational overhead while keeping a desired target visual quality, thus providing a powerful optimization for interactive XR applications.

SAGE: Semantic-Driven Adaptive Gaussian Splatting in Extended Reality

TL;DR

XR rendering under tight memory and compute budgets is addressed by SAGE, which fuses semantic segmentation with 3D Gaussian Splatting to perform per-semantic-category level-of-detail optimization guided by a target . The method maps 2D semantic labels onto a Structure-from-Motion point cloud and solves a constrained optimization , using a distance-aware, piecewise exponential model for and fitting parameters per label. Evaluations on the Mip-NeRF360 dataset show that SAGE substantially reduces Gaussians and memory while maintaining comparable visual quality, with transferable per-label iterations across scenes and robust cross-view performance. The approach enables scalable, real-time XR rendering by prioritizing resources where semantic importance and viewpoint proximity demand higher fidelity.

Abstract

3D Gaussian Splatting (3DGS) has significantly improved the efficiency and realism of three-dimensional scene visualization in several applications, ranging from robotics to eXtended Reality (XR). This work presents SAGE (Semantic-Driven Adaptive Gaussian Splatting in Extended Reality), a novel framework designed to enhance the user experience by dynamically adapting the Level of Detail (LOD) of different 3DGS objects identified via a semantic segmentation. Experimental results demonstrate how SAGE effectively reduces memory and computational overhead while keeping a desired target visual quality, thus providing a powerful optimization for interactive XR applications.

Paper Structure

This paper contains 8 sections, 2 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: SAGE pipeline. Starting from a set of 2D views $V$, SAGE retrieves the 2D semantics using DeepLabV2 deeplabv2. In parallel, it constructs the Structure from Motion point cloud, like in standard 3DGS. Then it processes the SfM point cloud at increasing resolution with proceeding iteration $i$. Differently from standard 3DGS, SAGE follows the semantic masks provided on 2D views to partition the 3D point cloud and perform selective optimization of different semantic categories. By setting a target quality (SSIM$_t$) the optimization of each semantic category stops the optimization process when such target value is achieved. The final render from selected viewpoint $v$ is obtained as a composition of the scene categories optimized separately for target quality.
  • Figure 2: Mean SSIM over training iterations for individual scene components of scene "bicycle". Variations in optimization performance across semantic categories are visible. Highly textured content (e.g., grass-merged) show lower overall quality compared to smooth areas (e.g., sky-other-merged).
  • Figure 3: SSIM as a function of the minimum distance for the semantic labels bench, bicycle and pavement-merged. Each curve represents data collected at a different iteration $i$ of 3DGS, with experimental data (dots) and fitted trends (lines). The SSIM generally increases with distance, reaching a peak before stabilizing or declining, with higher iterations showing improved reconstruction quality and smoother trends.
  • Figure 4: Qualitative results on view " DSC8719" of scene "bicycle".
  • Figure 5: Qualitative cross-view results on "bicycle" and cross-scene results on "garden".