MuSASplat: Efficient Sparse-View 3D Gaussian Splats via Lightweight Multi-Scale Adaptation
Muyu Xu, Fangneng Zhan, Xiaoqin Zhang, Ling Shao, Shijian Lu
TL;DR
MuSASplat tackles the challenges of pose-free sparse-view 3D reconstruction by introducing a lightweight Multi-Scale Adapter for ViT fine-tuning and a batch-wise Feature Fusion Aggregator for efficient multi-view feature fusion. The approach preserves high-quality novel-view rendering while dramatically reducing training cost and memory usage, outperforming both pose-free and some pose-required baselines, especially in 5-view scenarios. Key contributions include the spatially aware MuSA, the memory-efficient FFA, and a point-cloud viewpoint augmentation strategy that enhances performance in sparse or borderline view configurations. Overall, MuSASplat demonstrates state-of-the-art rendering quality with substantially fewer trainable parameters and lower computational demands, offering practical benefits for scalable, unposed 3D reconstruction.
Abstract
Sparse-view 3D Gaussian splatting seeks to render high-quality novel views of 3D scenes from a limited set of input images. While recent pose-free feed-forward methods leveraging pre-trained 3D priors have achieved impressive results, most of them rely on full fine-tuning of large Vision Transformer (ViT) backbones and incur substantial GPU costs. In this work, we introduce MuSASplat, a novel framework that dramatically reduces the computational burden of training pose-free feed-forward 3D Gaussian splats models with little compromise of rendering quality. Central to our approach is a lightweight Multi-Scale Adapter that enables efficient fine-tuning of ViT-based architectures with only a small fraction of training parameters. This design avoids the prohibitive GPU overhead associated with previous full-model adaptation techniques while maintaining high fidelity in novel view synthesis, even with very sparse input views. In addition, we introduce a Feature Fusion Aggregator that integrates features across input views effectively and efficiently. Unlike widely adopted memory banks, the Feature Fusion Aggregator ensures consistent geometric integration across input views and meanwhile mitigates the memory usage, training complexity, and computational costs significantly. Extensive experiments across diverse datasets show that MuSASplat achieves state-of-the-art rendering quality but has significantly reduced parameters and training resource requirements as compared with existing methods.
