Table of Contents
Fetching ...

UniGS: Modeling Unitary 3D Gaussians for Novel View Synthesis from Sparse-view Images

Jiamin Wu, Kenkun Liu, Yukai Shi, Xiaoke Jiang, Yuan Yao, Lei Zhang

TL;DR

UniGS introduces a unified, unitary 3D Gaussian representation for sparse-view novel-view synthesis. It uses a DETR-like encoder–decoder with multi-view deformable cross-attention (MVDFA) to update a fixed set of world-space Gaussians, aided by camera-modulated, view-specific queries and a spatially efficient self-attention module. The approach mitigates ghosting, allocates more Gaussians to complex regions, and supports arbitrary numbers of input views at inference without retraining, achieving state-of-the-art PSNR on the GSO benchmark and strong qualitative results. The method is validated through extensive ablations on initialization, cross-view attention, SESA, and view-count generalization, and it enables applications in 3D generation and text-to-3D pipelines. Overall, UniGS offers a scalable, view-agnostic framework for high-fidelity 3D reconstruction from sparse views with practical implications for open-category 3D synthesis and rendering.

Abstract

In this work, we introduce UniGS, a novel 3D Gaussian reconstruction and novel view synthesis model that predicts a high-fidelity representation of 3D Gaussians from arbitrary number of posed sparse-view images. Previous methods often regress 3D Gaussians locally on a per-pixel basis for each view and then transfer them to world space and merge them through point concatenation. In contrast, Our approach involves modeling unitary 3D Gaussians in world space and updating them layer by layer. To leverage information from multi-view inputs for updating the unitary 3D Gaussians, we develop a DETR (DEtection TRansformer)-like framework, which treats 3D Gaussians as queries and updates their parameters by performing multi-view cross-attention (MVDFA) across multiple input images, which are treated as keys and values. This approach effectively avoids `ghosting' issue and allocates more 3D Gaussians to complex regions. Moreover, since the number of 3D Gaussians used as decoder queries is independent of the number of input views, our method allows arbitrary number of multi-view images as input without causing memory explosion or requiring retraining. Extensive experiments validate the advantages of our approach, showcasing superior performance over existing methods quantitatively (improving PSNR by 4.2 dB when trained on Objaverse and tested on the GSO benchmark) and qualitatively. The code will be released at https://github.com/jwubz123/UNIG.

UniGS: Modeling Unitary 3D Gaussians for Novel View Synthesis from Sparse-view Images

TL;DR

UniGS introduces a unified, unitary 3D Gaussian representation for sparse-view novel-view synthesis. It uses a DETR-like encoder–decoder with multi-view deformable cross-attention (MVDFA) to update a fixed set of world-space Gaussians, aided by camera-modulated, view-specific queries and a spatially efficient self-attention module. The approach mitigates ghosting, allocates more Gaussians to complex regions, and supports arbitrary numbers of input views at inference without retraining, achieving state-of-the-art PSNR on the GSO benchmark and strong qualitative results. The method is validated through extensive ablations on initialization, cross-view attention, SESA, and view-count generalization, and it enables applications in 3D generation and text-to-3D pipelines. Overall, UniGS offers a scalable, view-agnostic framework for high-fidelity 3D reconstruction from sparse views with practical implications for open-category 3D synthesis and rendering.

Abstract

In this work, we introduce UniGS, a novel 3D Gaussian reconstruction and novel view synthesis model that predicts a high-fidelity representation of 3D Gaussians from arbitrary number of posed sparse-view images. Previous methods often regress 3D Gaussians locally on a per-pixel basis for each view and then transfer them to world space and merge them through point concatenation. In contrast, Our approach involves modeling unitary 3D Gaussians in world space and updating them layer by layer. To leverage information from multi-view inputs for updating the unitary 3D Gaussians, we develop a DETR (DEtection TRansformer)-like framework, which treats 3D Gaussians as queries and updates their parameters by performing multi-view cross-attention (MVDFA) across multiple input images, which are treated as keys and values. This approach effectively avoids `ghosting' issue and allocates more 3D Gaussians to complex regions. Moreover, since the number of 3D Gaussians used as decoder queries is independent of the number of input views, our method allows arbitrary number of multi-view images as input without causing memory explosion or requiring retraining. Extensive experiments validate the advantages of our approach, showcasing superior performance over existing methods quantitatively (improving PSNR by 4.2 dB when trained on Objaverse and tested on the GSO benchmark) and qualitatively. The code will be released at https://github.com/jwubz123/UNIG.

Paper Structure

This paper contains 63 sections, 6 equations, 20 figures, 12 tables.

Figures (20)

  • Figure 1: (a) Previous methods like LGM LGM initially predict 3D Gaussians for each pixel for each view and then merge them to get the final 3D Gaussians, resulting in a 'ghosting' issue. Moreover, the 3D Gaussians are evenly distributed for both simple and complex regions, while there should be more 3D Gaussians for the complex regions. (b) In contrast, our approach utilizes a unitiary set of 3D Gaussians, projecting them onto each view and gathering information across views through a global optimization strategy. Our model effectively avoids the 'ghosting' problem and assigns more 3D Gaussians to complex areas (such as the 'door' in the image). (c) Our approach supports an arbitrary number of inputs views without requiring retraining and the performance does not deduced. (Trained on 4 input views and tested on 2 to 8 views).
  • Figure 2: UniGS: Queries are updated by $L$ decoder layers with multi-view image features extracted by the feature extractor. 3D Gaussians are regressed from the queries by an MLP in each layer. Subsequently, they are passed into the next layer and projected onto each view to derive reference points. MVDFA: multi-view deformable attention in \ref{['sec: decoder']}. SESA: spatial efficient self-attention in \ref{['sec: space_efficient_self_attn']}. The dashed arrow means that the centers of 3D Gaussians are projected to multi-view feature maps to retrieve the most related features.
  • Figure 3: MVDFA: $\mathbf{Q}_n$ denotes the $n$-th unitary queries while $\textbf{q}_{ni}$ denotes the $n$-th query on the $i$-th view modulated by the $i$-th camera $\mathbf{Cam}_i$. Sampling offsets $\Delta\mathbf{s}_{ni}$ and attention score $\boldsymbol{\alpha}_{ni}$ derived by conducting linear transformation on $\textbf{q}_{ni}$. The sampling offsets are utilized to sample image features at the sampling points $\mathbf{s}_{ni} = \mathbf{P}_{ni} + \Delta\mathbf{s}_{ni}$, where $\mathbf{P}_{ni}$ is the reference point derived by projecting the $n$-th 3D Gaussian. After that, $\mathbf{s}_{ni}$ is utilized to sample image features serving as values $\mathbf{v}_{ni}$. These values are employed to update the view-specific queries by attention scores $\boldsymbol{\alpha}_{ni}$. The unitary queries are refined by the weighted sum of updated view-specific queries $\textbf{q}'_{ni}$, where $w_i$ is the weight calculated by a linear layer on $\textbf{q}'_{ni}$. $B$ is batch size, $I$ is the number of views, $C$ is the hidden dimension, $N$ is the number of Gaussians, $\textit{pinhole\_proj}$ is the projection from 3D to 2D with the pinhole model. $\mathbf{F}$ is the image feature with height $H$ and width $W$. $\textbf{K}$ and $\boldsymbol{\pi}$ are camera intrinsics and extrinsics, respectively.
  • Figure 4: Novel views on GSO dataset for inputting 4 views with resolution 128.
  • Figure 5: 3D Gaussian center as point cloud on GSO dataset for inputting 4 views.
  • ...and 15 more figures