CUS-GS: A Compact Unified Structured Gaussian Splatting Framework for Multimodal Scene Representation

Yuhang Ming; Chenxin Fang; Xingyuan Yu; Fan Zhang; Weichen Dai; Wanzeng Kong; Guofeng Zhang

CUS-GS: A Compact Unified Structured Gaussian Splatting Framework for Multimodal Scene Representation

Yuhang Ming, Chenxin Fang, Xingyuan Yu, Fan Zhang, Weichen Dai, Wanzeng Kong, Guofeng Zhang

TL;DR

The paper addresses the gap between geometry-driven structure and semantics-driven multimodal understanding in 3D scene representations. It introduces CUS-GS, a compact unified framework that couples a voxel-anchored anchor scaffold with a multimodal memory, leverages hierarchical query adaptation, and employs feature-aware pruning to fuse appearance, geometry, and semantics from multiple foundation models. The approach achieves competitive rendering quality with a model size around 20 MB and strong multimodal feature alignment, outperforming several larger baselines on perceptual and semantic metrics and demonstrating robust downstream task performance. This work signals a scalable path toward semantically grounded, structure-preserving 3D scene representations suitable for real-time rendering and downstream robotics/vision applications, while dramatically reducing parameter counts.

Abstract

Recent advances in Gaussian Splatting based 3D scene representation have shown two major trends: semantics-oriented approaches that focus on high-level understanding but lack explicit 3D geometry modeling, and structure-oriented approaches that capture spatial structures yet provide limited semantic abstraction. To bridge this gap, we present CUS-GS, a compact unified structured Gaussian Splatting representation, which connects multimodal semantic features with structured 3D geometry. Specifically, we design a voxelized anchor structure that constructs a spatial scaffold, while extracting multimodal semantic features from a set of foundation models (e.g., CLIP, DINOv2, SEEM). Moreover, we introduce a multimodal latent feature allocation mechanism to unify appearance, geometry, and semantics across heterogeneous feature spaces, ensuring a consistent representation across multiple foundation models. Finally, we propose a feature-aware significance evaluation strategy to dynamically guide anchor growing and pruning, effectively removing redundant or invalid anchors while maintaining semantic integrity. Extensive experiments show that CUS-GS achieves competitive performance compared to state-of-the-art methods using as few as 6M parameters - an order of magnitude smaller than the closest rival at 35M - highlighting the excellent trade off between performance and model efficiency of the proposed framework.

CUS-GS: A Compact Unified Structured Gaussian Splatting Framework for Multimodal Scene Representation

TL;DR

Abstract

CUS-GS: A Compact Unified Structured Gaussian Splatting Framework for Multimodal Scene Representation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)