Table of Contents
Fetching ...

LeanGaussian: Breaking Pixel or Point Cloud Correspondence in Modeling 3D Gaussians

Jiamin Wu, Kenkun Liu, Han Gao, Xiaoke Jiang, Yao Yuan, Lei Zhang

TL;DR

LeanGaussian eliminates one-to-one pixel/point correspondences by modeling each 3D Gaussian as a learnable query within a deformable Transformer, using the projected 3D Gaussian centers as 2D reference points to guide attention. The method employs a depth-aware feature extractor, layerwise refinement, and a GaussianDFA cross-attention mechanism to iteratively update 3D Gaussian ellipsoids directly from a single RGB image. It achieves state-of-the-art performance on ShapeNet-SRN and Google Scanned Objects with fast reconstruction and rendering speeds (7.2 FPS and 500 FPS, respectively) without requiring dense 3D supervision. The work demonstrates that explicit 3D Gaussians learned through queries can outperform traditional pixel- or point-aligned Gaussian representations, enabling more efficient and geometrically accurate novel view synthesis with practical inference efficiency.

Abstract

Recently, Gaussian splatting has demonstrated significant success in novel view synthesis. Current methods often regress Gaussians with pixel or point cloud correspondence, linking each Gaussian with a pixel or a 3D point. This leads to the redundancy of Gaussians being used to overfit the correspondence rather than the objects represented by the 3D Gaussians themselves, consequently wasting resources and lacking accurate geometries or textures. In this paper, we introduce LeanGaussian, a novel approach that treats each query in deformable Transformer as one 3D Gaussian ellipsoid, breaking the pixel or point cloud correspondence constraints. We leverage deformable decoder to iteratively refine the Gaussians layer-by-layer with the image features as keys and values. Notably, the center of each 3D Gaussian is defined as 3D reference points, which are then projected onto the image for deformable attention in 2D space. On both the ShapeNet SRN dataset (category level) and the Google Scanned Objects dataset (open-category level, trained with the Objaverse dataset), our approach, outperforms prior methods by approximately 6.1%, achieving a PSNR of 25.44 and 22.36, respectively. Additionally, our method achieves a 3D reconstruction speed of 7.2 FPS and rendering speed 500 FPS. Codes are available at https://github.com/jwubz123/LeanGaussian.

LeanGaussian: Breaking Pixel or Point Cloud Correspondence in Modeling 3D Gaussians

TL;DR

LeanGaussian eliminates one-to-one pixel/point correspondences by modeling each 3D Gaussian as a learnable query within a deformable Transformer, using the projected 3D Gaussian centers as 2D reference points to guide attention. The method employs a depth-aware feature extractor, layerwise refinement, and a GaussianDFA cross-attention mechanism to iteratively update 3D Gaussian ellipsoids directly from a single RGB image. It achieves state-of-the-art performance on ShapeNet-SRN and Google Scanned Objects with fast reconstruction and rendering speeds (7.2 FPS and 500 FPS, respectively) without requiring dense 3D supervision. The work demonstrates that explicit 3D Gaussians learned through queries can outperform traditional pixel- or point-aligned Gaussian representations, enabling more efficient and geometrically accurate novel view synthesis with practical inference efficiency.

Abstract

Recently, Gaussian splatting has demonstrated significant success in novel view synthesis. Current methods often regress Gaussians with pixel or point cloud correspondence, linking each Gaussian with a pixel or a 3D point. This leads to the redundancy of Gaussians being used to overfit the correspondence rather than the objects represented by the 3D Gaussians themselves, consequently wasting resources and lacking accurate geometries or textures. In this paper, we introduce LeanGaussian, a novel approach that treats each query in deformable Transformer as one 3D Gaussian ellipsoid, breaking the pixel or point cloud correspondence constraints. We leverage deformable decoder to iteratively refine the Gaussians layer-by-layer with the image features as keys and values. Notably, the center of each 3D Gaussian is defined as 3D reference points, which are then projected onto the image for deformable attention in 2D space. On both the ShapeNet SRN dataset (category level) and the Google Scanned Objects dataset (open-category level, trained with the Objaverse dataset), our approach, outperforms prior methods by approximately 6.1%, achieving a PSNR of 25.44 and 22.36, respectively. Additionally, our method achieves a 3D reconstruction speed of 7.2 FPS and rendering speed 500 FPS. Codes are available at https://github.com/jwubz123/LeanGaussian.
Paper Structure (36 sections, 5 equations, 12 figures, 9 tables)

This paper contains 36 sections, 5 equations, 12 figures, 9 tables.

Figures (12)

  • Figure 1: Comparison between our method and previous approaches. (a)Splatter Image SplatterImage directly models Gaussians for each pixel from image features, often resulting in an overabundance of background Gaussians. Triplane-Gaussian triplane-gs models 3D Gaussians using dense point cloud and triplane methods, leading to intricate and inefficient modeling and computation. In contrast, our method utilizes 3D Gaussians through queries, maintaining a streamlined representation. The centers of these Gaussian ellipsoids are depicted as a point cloud and shown for visualization and the dashed line means projection from 3D Gaussian to 2D plane. (b) Comparison of memory usage between our model and previous work. As resolution increases, memory costs increase quadratically for Splatter Image but not for our approach. Triplane-Gaussian has a much larger memory cost.
  • Figure 2: (a) Overview of LeanGaussian. The initial Gaussians are calculated from random queries using the splatter head. $\dashrightarrow$: steps are utilized in training only. (b) Detailed structure for feature fusion in the feature extractor. (c) Detailed structure for one decoder layer. Queries are updated at each layer and serve as input for the next layer, while the reference points are updated based on the new centers of the Gaussians and projected onto the image feature plane. GaussianDFA: deformable cross attention layer; FFN: Feed Forward Network; $\bigoplus$: updation of 3D Gaussian. $q^l$ denotes the query of $l$-th layer.
  • Figure 3: 3D Gaussians' centers are projected onto the image feature maps. By training sampling offsets, deformable attention is performed on the features at the sampling points and queries.
  • Figure 4: Comparison of NVS results between models trained on Objaverse LVIS and tested on GSO dataset.
  • Figure 5: Comparison of NVS results between models on ShapeNet SRN. In the chairs dataset, certain objects in the Splatter Image exhibit subpar performance in terms of geometry and color accuracy. For the cars dataset, the white car shows incorrect seat color, whereas the blue and red cars exhibit incorrect geometry.
  • ...and 7 more figures