Table of Contents
Fetching ...

MuSASplat: Efficient Sparse-View 3D Gaussian Splats via Lightweight Multi-Scale Adaptation

Muyu Xu, Fangneng Zhan, Xiaoqin Zhang, Ling Shao, Shijian Lu

TL;DR

MuSASplat tackles the challenges of pose-free sparse-view 3D reconstruction by introducing a lightweight Multi-Scale Adapter for ViT fine-tuning and a batch-wise Feature Fusion Aggregator for efficient multi-view feature fusion. The approach preserves high-quality novel-view rendering while dramatically reducing training cost and memory usage, outperforming both pose-free and some pose-required baselines, especially in 5-view scenarios. Key contributions include the spatially aware MuSA, the memory-efficient FFA, and a point-cloud viewpoint augmentation strategy that enhances performance in sparse or borderline view configurations. Overall, MuSASplat demonstrates state-of-the-art rendering quality with substantially fewer trainable parameters and lower computational demands, offering practical benefits for scalable, unposed 3D reconstruction.

Abstract

Sparse-view 3D Gaussian splatting seeks to render high-quality novel views of 3D scenes from a limited set of input images. While recent pose-free feed-forward methods leveraging pre-trained 3D priors have achieved impressive results, most of them rely on full fine-tuning of large Vision Transformer (ViT) backbones and incur substantial GPU costs. In this work, we introduce MuSASplat, a novel framework that dramatically reduces the computational burden of training pose-free feed-forward 3D Gaussian splats models with little compromise of rendering quality. Central to our approach is a lightweight Multi-Scale Adapter that enables efficient fine-tuning of ViT-based architectures with only a small fraction of training parameters. This design avoids the prohibitive GPU overhead associated with previous full-model adaptation techniques while maintaining high fidelity in novel view synthesis, even with very sparse input views. In addition, we introduce a Feature Fusion Aggregator that integrates features across input views effectively and efficiently. Unlike widely adopted memory banks, the Feature Fusion Aggregator ensures consistent geometric integration across input views and meanwhile mitigates the memory usage, training complexity, and computational costs significantly. Extensive experiments across diverse datasets show that MuSASplat achieves state-of-the-art rendering quality but has significantly reduced parameters and training resource requirements as compared with existing methods.

MuSASplat: Efficient Sparse-View 3D Gaussian Splats via Lightweight Multi-Scale Adaptation

TL;DR

MuSASplat tackles the challenges of pose-free sparse-view 3D reconstruction by introducing a lightweight Multi-Scale Adapter for ViT fine-tuning and a batch-wise Feature Fusion Aggregator for efficient multi-view feature fusion. The approach preserves high-quality novel-view rendering while dramatically reducing training cost and memory usage, outperforming both pose-free and some pose-required baselines, especially in 5-view scenarios. Key contributions include the spatially aware MuSA, the memory-efficient FFA, and a point-cloud viewpoint augmentation strategy that enhances performance in sparse or borderline view configurations. Overall, MuSASplat demonstrates state-of-the-art rendering quality with substantially fewer trainable parameters and lower computational demands, offering practical benefits for scalable, unposed 3D reconstruction.

Abstract

Sparse-view 3D Gaussian splatting seeks to render high-quality novel views of 3D scenes from a limited set of input images. While recent pose-free feed-forward methods leveraging pre-trained 3D priors have achieved impressive results, most of them rely on full fine-tuning of large Vision Transformer (ViT) backbones and incur substantial GPU costs. In this work, we introduce MuSASplat, a novel framework that dramatically reduces the computational burden of training pose-free feed-forward 3D Gaussian splats models with little compromise of rendering quality. Central to our approach is a lightweight Multi-Scale Adapter that enables efficient fine-tuning of ViT-based architectures with only a small fraction of training parameters. This design avoids the prohibitive GPU overhead associated with previous full-model adaptation techniques while maintaining high fidelity in novel view synthesis, even with very sparse input views. In addition, we introduce a Feature Fusion Aggregator that integrates features across input views effectively and efficiently. Unlike widely adopted memory banks, the Feature Fusion Aggregator ensures consistent geometric integration across input views and meanwhile mitigates the memory usage, training complexity, and computational costs significantly. Extensive experiments across diverse datasets show that MuSASplat achieves state-of-the-art rendering quality but has significantly reduced parameters and training resource requirements as compared with existing methods.

Paper Structure

This paper contains 17 sections, 12 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Unlike state-of-the-art sparse-view 3D rendering approaches that require either camera poses (like pixelSplat) or computationally intensive full model fine-tuning (like NoPoSplat), the proposed MuSASplat achieves superior reconstruction performance with unposed images and much reduced network parameters and computation costs. The experiments evaluate SSIM versus model size (in millions of parameters) on the RE10K dataset. We also report MuSASplat (LoRA3D), a variant where our adapter is replaced with LoRA3D lu2024lora3d, showing that the proposed Multi-Scale Adapter provides a clear advantage in accuracy.
  • Figure 2: Overview of the MuSASplat architecture. Given unposed multi-view input images, we extract per-view features using a ViT encoder. ViT consists of multiple stacked ViT blocks, which are the basic computational units containing self-attention and feed-forward sublayers. Our proposed Multi-Scale Adapter (MuSA) modules are inserted into the blocks to enhance spatial awareness while introducing minimal extra parameters. The resulting features from different views are fused in a single forward pass by the Feature Fusion Aggregator (FFA), which adaptively integrates geometric information using view-specific quality estimator and boundary detector as elaborated in section 3.4. The fused feature is then decoded by a lightweight ViT decoder and passed to a point head and a Gaussian head to generate the parameters of 3D Gaussian primitives for rendering.
  • Figure 3: Comparison between LoRA and our proposed MuSA layer. Top: LoRA injects a low-rank residual update into the frozen pre-trained model via two linear layers, operating purely in the token space without spatial awareness. Bottom: MuSA reconstructs the spatial layout of tokens into a feature map, applies depth-wise convolutions at multiple kernel sizes to capture local structure, and projects the adapted features back into the token stream. This design enables spatial reasoning while maintaining parameter efficiency.
  • Figure 4: Qualitative comparison on RE10K. Compared to NoPoSplat and PixelSplat, our MuSASplat yields more complete geometry and fewer artifacts in occluded regions. The major differences are highlighted with red boxes.