Table of Contents
Fetching ...

CUS-GS: A Compact Unified Structured Gaussian Splatting Framework for Multimodal Scene Representation

Yuhang Ming, Chenxin Fang, Xingyuan Yu, Fan Zhang, Weichen Dai, Wanzeng Kong, Guofeng Zhang

TL;DR

The paper addresses the gap between geometry-driven structure and semantics-driven multimodal understanding in 3D scene representations. It introduces CUS-GS, a compact unified framework that couples a voxel-anchored anchor scaffold with a multimodal memory, leverages hierarchical query adaptation, and employs feature-aware pruning to fuse appearance, geometry, and semantics from multiple foundation models. The approach achieves competitive rendering quality with a model size around 20 MB and strong multimodal feature alignment, outperforming several larger baselines on perceptual and semantic metrics and demonstrating robust downstream task performance. This work signals a scalable path toward semantically grounded, structure-preserving 3D scene representations suitable for real-time rendering and downstream robotics/vision applications, while dramatically reducing parameter counts.

Abstract

Recent advances in Gaussian Splatting based 3D scene representation have shown two major trends: semantics-oriented approaches that focus on high-level understanding but lack explicit 3D geometry modeling, and structure-oriented approaches that capture spatial structures yet provide limited semantic abstraction. To bridge this gap, we present CUS-GS, a compact unified structured Gaussian Splatting representation, which connects multimodal semantic features with structured 3D geometry. Specifically, we design a voxelized anchor structure that constructs a spatial scaffold, while extracting multimodal semantic features from a set of foundation models (e.g., CLIP, DINOv2, SEEM). Moreover, we introduce a multimodal latent feature allocation mechanism to unify appearance, geometry, and semantics across heterogeneous feature spaces, ensuring a consistent representation across multiple foundation models. Finally, we propose a feature-aware significance evaluation strategy to dynamically guide anchor growing and pruning, effectively removing redundant or invalid anchors while maintaining semantic integrity. Extensive experiments show that CUS-GS achieves competitive performance compared to state-of-the-art methods using as few as 6M parameters - an order of magnitude smaller than the closest rival at 35M - highlighting the excellent trade off between performance and model efficiency of the proposed framework.

CUS-GS: A Compact Unified Structured Gaussian Splatting Framework for Multimodal Scene Representation

TL;DR

The paper addresses the gap between geometry-driven structure and semantics-driven multimodal understanding in 3D scene representations. It introduces CUS-GS, a compact unified framework that couples a voxel-anchored anchor scaffold with a multimodal memory, leverages hierarchical query adaptation, and employs feature-aware pruning to fuse appearance, geometry, and semantics from multiple foundation models. The approach achieves competitive rendering quality with a model size around 20 MB and strong multimodal feature alignment, outperforming several larger baselines on perceptual and semantic metrics and demonstrating robust downstream task performance. This work signals a scalable path toward semantically grounded, structure-preserving 3D scene representations suitable for real-time rendering and downstream robotics/vision applications, while dramatically reducing parameter counts.

Abstract

Recent advances in Gaussian Splatting based 3D scene representation have shown two major trends: semantics-oriented approaches that focus on high-level understanding but lack explicit 3D geometry modeling, and structure-oriented approaches that capture spatial structures yet provide limited semantic abstraction. To bridge this gap, we present CUS-GS, a compact unified structured Gaussian Splatting representation, which connects multimodal semantic features with structured 3D geometry. Specifically, we design a voxelized anchor structure that constructs a spatial scaffold, while extracting multimodal semantic features from a set of foundation models (e.g., CLIP, DINOv2, SEEM). Moreover, we introduce a multimodal latent feature allocation mechanism to unify appearance, geometry, and semantics across heterogeneous feature spaces, ensuring a consistent representation across multiple foundation models. Finally, we propose a feature-aware significance evaluation strategy to dynamically guide anchor growing and pruning, effectively removing redundant or invalid anchors while maintaining semantic integrity. Extensive experiments show that CUS-GS achieves competitive performance compared to state-of-the-art methods using as few as 6M parameters - an order of magnitude smaller than the closest rival at 35M - highlighting the excellent trade off between performance and model efficiency of the proposed framework.

Paper Structure

This paper contains 18 sections, 15 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: CUS-GS is the first framework to unify structured 3DGS with multimodal semantic modeling. The voxel-anchored structured design produces a geometry-aware and multimodally aligned 3D feature field, while maintaining high efficiency—achieving competitive performance with as few as 6M parameters, comparing to 35M of the closest rival.
  • Figure 2: Architecture Overview. Our CUS-GS bridges structured 3DGS with multimodal scene understanding through a voxelized scaffold and a unified multimodal memory bank. Each voxel maintains a latent feature that, together with view-dependent appearance cues, is decoded into hierarchical queries and Gaussian attributess. Multimodal features extracted from 6 foundation models are first compressed via PSC, and are then attended by learned queries to retrieve aligned semantic features. The resulting Gaussian attributes drive differentiable splatting, while the queried semantics enrich the representation, yielding compact, spatially consistent, and multimodally expressive 3D scene representations.
  • Figure 3: Qualitative results: Comparison of example feature maps reconstructed from different foundation models. Our CUS-GS produces smoother and more spatially coherent representations than M3, with clearer structures in language-driven features (e.g., LLaMA3) and slightly smoother texture features (e.g., DINOv2).
  • Figure 4: Examples of the Image Rendering Results. For each scene, the upper row shows the rendered image, and the lower row presents the residual between the rendered image and the ground truth image.
  • Figure 5: Additional Feature Rendering Results. These results provide complementary evidence that CUS-GS outperforms M3 in producing cleaner and more structured multimodal feature fields.
  • ...and 3 more figures