Table of Contents
Fetching ...

Omni-Scene: Omni-Gaussian Representation for Ego-Centric Sparse-View Scene Reconstruction

Dongxu Wei, Zhiqi Li, Peidong Liu

TL;DR

Omni-Scene introduces Omni-Gaussian representation to unify pixel-based and volume-based Gaussians for ego-centric sparse-view reconstruction. The Volume Builder (Triplane Transformer + Volume Decoder) and Pixel Decorator (Multi-View U-Net + Pixel Decoder) are designed to produce complementary Gaussian fields, which are fused via Projection-Based Feature Fusion and Depth-Guided Training Decomposition to form full Omni-Gaussians for novel-view rendering. Empirical results show substantial gains over pixelSplat and MVSplat on ego-centric reconstruction and competitive performance on RealEstate10K, with strong ablations confirming the value of cross-representation collaboration, depth initialization, and efficient 3D feature encoding. The approach enables fast, high-fidelity 3D scene reconstruction from single-frame surround views and supports multi-modal 3D scene generation when integrated with diffusion-based 2D models, broadening practical applications in autonomous driving and 3D content creation.

Abstract

Prior works employing pixel-based Gaussian representation have demonstrated efficacy in feed-forward sparse-view reconstruction. However, such representation necessitates cross-view overlap for accurate depth estimation, and is challenged by object occlusions and frustum truncations. As a result, these methods require scene-centric data acquisition to maintain cross-view overlap and complete scene visibility to circumvent occlusions and truncations, which limits their applicability to scene-centric reconstruction. In contrast, in autonomous driving scenarios, a more practical paradigm is ego-centric reconstruction, which is characterized by minimal cross-view overlap and frequent occlusions and truncations. The limitations of pixel-based representation thus hinder the utility of prior works in this task. In light of this, this paper conducts an in-depth analysis of different representations, and introduces Omni-Gaussian representation with tailored network design to complement their strengths and mitigate their drawbacks. Experiments show that our method significantly surpasses state-of-the-art methods, pixelSplat and MVSplat, in ego-centric reconstruction, and achieves comparable performance to prior works in scene-centric reconstruction.

Omni-Scene: Omni-Gaussian Representation for Ego-Centric Sparse-View Scene Reconstruction

TL;DR

Omni-Scene introduces Omni-Gaussian representation to unify pixel-based and volume-based Gaussians for ego-centric sparse-view reconstruction. The Volume Builder (Triplane Transformer + Volume Decoder) and Pixel Decorator (Multi-View U-Net + Pixel Decoder) are designed to produce complementary Gaussian fields, which are fused via Projection-Based Feature Fusion and Depth-Guided Training Decomposition to form full Omni-Gaussians for novel-view rendering. Empirical results show substantial gains over pixelSplat and MVSplat on ego-centric reconstruction and competitive performance on RealEstate10K, with strong ablations confirming the value of cross-representation collaboration, depth initialization, and efficient 3D feature encoding. The approach enables fast, high-fidelity 3D scene reconstruction from single-frame surround views and supports multi-modal 3D scene generation when integrated with diffusion-based 2D models, broadening practical applications in autonomous driving and 3D content creation.

Abstract

Prior works employing pixel-based Gaussian representation have demonstrated efficacy in feed-forward sparse-view reconstruction. However, such representation necessitates cross-view overlap for accurate depth estimation, and is challenged by object occlusions and frustum truncations. As a result, these methods require scene-centric data acquisition to maintain cross-view overlap and complete scene visibility to circumvent occlusions and truncations, which limits their applicability to scene-centric reconstruction. In contrast, in autonomous driving scenarios, a more practical paradigm is ego-centric reconstruction, which is characterized by minimal cross-view overlap and frequent occlusions and truncations. The limitations of pixel-based representation thus hinder the utility of prior works in this task. In light of this, this paper conducts an in-depth analysis of different representations, and introduces Omni-Gaussian representation with tailored network design to complement their strengths and mitigate their drawbacks. Experiments show that our method significantly surpasses state-of-the-art methods, pixelSplat and MVSplat, in ego-centric reconstruction, and achieves comparable performance to prior works in scene-centric reconstruction.

Paper Structure

This paper contains 22 sections, 4 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: Illustration of our Omni-Gaussian representation. Our Omni-Gaussian incorporates two representations, pixel-based and volume-based Gaussians. In (a), we illustrate bad cases when relying solely on one representation (i.e., Case 1 and 2 for pixel-based, Case 3 and 4 for volume-based), and how we use the other one to compensate for the shortcomings. In (b)-(e), we present examples of these cases under the task of ego-centric driving scene reconstruction. Green dashed lines denote areas plausibly rendered in novel views, while red ones highlight undesirable artifacts due to weaknesses of pixel-based or volume-based Gaussian. We can observe that Omni-Gaussian leveraging the complementary nature of the two representations can achieve optimal results for all cases.
  • Figure 2: Overview. (a) Obtain images $\{\boldsymbol{I}^i\}^K_{i=1}$ from surrounding cameras with minimal overlap (e.g., adjacent image areas enclosed by green rectangles) in a single frame, and extract 2D features using image backbone. (b) For Volume Builder, we first use Triplane Transformer to lift 2D features $\{\boldsymbol{F}^i\}^K_{i=1}$ to 3D volume space compressed by three orthogonal planes, where we employ cross-image and cross-plane deformable attentions to enhance feature encoding. Then, Volume Decoder takes voxels as anchors, and predict nearby Gaussians $\mathcal{G}_V$ for each voxel given features sampled from the three planes through bilinear interpolation. (c) For Pixel Decorator, we use Multi-View U-Net to propagate information across views and extract multiple 2D features for Pixel Decoder to predict pixel-based Gaussians $\mathcal{G}_P$ along rays. Through Volume-Pixel Collaborations including Projection-Based Feature Fusion and Depth-Guided Training Decomposition, we can make $\mathcal{G}_V$ and $\mathcal{G}_P$ complement for each other, and obtain the full Omni-Gaussians $\mathcal{G}$ for novel-view rendering.
  • Figure 3: Comparisons on nuScenes nusc2020. Images of input views (Inputs) and ground-truth novel views (GTs) are outlined by orange and blue rectangles, respectively. The remaining are generated novel views and depth maps (warmer colors denote greater distance while the opposite for cooler colors). The red dashed circles denote undesirable artifacts, while the green ones denote plausibly-rendered areas.
  • Figure 4: Comparisons on RealEstate10K re10k2018. The red dashed circles denote undesirable artifacts, while the green ones denote plausibly-rendered areas.
  • Figure 5: Multi-modal 3D scene generation. We accept multi-modal conditions (i.e., 3D boxes, BEV map, textual descriptions) as inputs, and generate the corresponding 3D driving scenes in a feed-forward manner. For better visualization, we render 360-degree rotation videos for the generated 3D scenes, and stitch frames into panoramic images as shown in (a) and (b). We can see that the styles of the generated scenes closely match the textual conditions. Besides, when the appearances vary with random seeds, the spatial consistency with conditional 3D boxes (denoted by colored rectangles in (a) and (b)) is well preserved. Compared to per-scene optimization-based method MagicDrive3D md3d2024 that leads to artifacts highlighted by red dashed lines in (c), we achieve higher quality with better visual details. Please consult our supplementary material for comparisons in video format, where we can better observe the differences in visual quality.
  • ...and 7 more figures