F3D-Gaus: Feed-forward 3D-aware Generation on ImageNet with Cycle-Aggregative Gaussian Splatting

Yuxin Wang; Qianyi Wu; Dan Xu

F3D-Gaus: Feed-forward 3D-aware Generation on ImageNet with Cycle-Aggregative Gaussian Splatting

Yuxin Wang, Qianyi Wu, Dan Xu

TL;DR

This work tackles generalizable 3D-aware generation from monocular data by introducing F3D-Gaus, a feed-forward framework that predicts pixel-aligned 3D Gaussian Splatting (GS) primitives from RGB-D inputs. A cycle-aggregative self-supervised strategy fuses representations from canonical and novel views via complementary aggregation and cycle supervision, enabling robust cross-view consistency without multi-view supervision. To further polish wide-angle renders, the method employs geometry-guided texture refinement using a video in-painting model guided by artifact masking from alpha and normal maps. Across ImageNet and other datasets, F3D-Gaus achieves state-of-the-art realism (FID/NFS) and competitive depth quality while offering faster training and inference, demonstrating strong generalization and practical applicability for scalable 3D content generation.

Abstract

This paper tackles the problem of generalizable 3D-aware generation from monocular datasets, e.g., ImageNet. The key challenge of this task is learning a robust 3D-aware representation without multi-view or dynamic data, while ensuring consistent texture and geometry across different viewpoints. Although some baseline methods are capable of 3D-aware generation, the quality of the generated images still lags behind state-of-the-art 2D generation approaches, which excel in producing high-quality, detailed images. To address this severe limitation, we propose a novel feed-forward pipeline based on pixel-aligned Gaussian Splatting, coined as F3D-Gaus, which can produce more realistic and reliable 3D renderings from monocular inputs. In addition, we introduce a self-supervised cycle-aggregative constraint to enforce cross-view consistency in the learned 3D representation. This training strategy naturally allows aggregation of multiple aligned Gaussian primitives and significantly alleviates the interpolation limitations inherent in single-view pixel-aligned Gaussian Splatting. Furthermore, we incorporate video model priors to perform geometry-aware refinement, enhancing the generation of fine details in wide-viewpoint scenarios and improving the model's capability to capture intricate 3D textures. Extensive experiments demonstrate that our approach not only achieves high-quality, multi-view consistent 3D-aware generation from monocular datasets, but also significantly improves training and inference efficiency.

F3D-Gaus: Feed-forward 3D-aware Generation on ImageNet with Cycle-Aggregative Gaussian Splatting

TL;DR

Abstract

Paper Structure (19 sections, 13 equations, 14 figures, 5 tables)

This paper contains 19 sections, 13 equations, 14 figures, 5 tables.

Introduction
Related Works
Novel view synthesis for single input
3D Gaussian Splatting
The Proposed Framework: F3D-Gaus
Preliminary
Cycle-aggregative Self-supervised Strategy
Geometry-guided Texture Refinement
Experiments
Experimental Setup
Main Results
Ablation Study
Conclusion
Additional Experiments
More Qualitative Results
...and 4 more sections

Figures (14)

Figure 1: Illustration of our motivation for cycle self-supervised training. For monocular datasets: (a) supervision is naturally available for the canonical view. (b) For novel views, where supervision is absent, we use the rendered novel-view image as input to obtain its 3D representation. This 3D representation is then re-rendered from the canonical view, where supervision is available. Red arrows indicate feed-forward 3D representation prediction from a monocular image, while blue arrows represent the rendering processes from 3D representations at different specific viewpoints.
Figure 2: Illustration of our overall framework. Given a single RGB image $I_0$ and depth map $D_0$, our model directly feeds them forward to output the pixel-aligned Gaussian Splatting representation $GS_0$, which can be used for novel view synthesis. After obtaining the 3DGS representation, we render the image $\tilde{I_1}$ and depth maps $\tilde{D_1}$ for the novel view, and then output its corresponding 3DGS $GS_1$. These two 3DGS representations are subsequently aggregated to produce the images for supervision. This novel self-supervised training strategy enforces cycle-aggregative 3D representation learning across different views, allowing the generalized 3DGS representations to reinforce each other, thereby collaboratively enhancing the overall 3D representation capability.
Figure 3: Illustration of the proposed cycle-aggregative self-supervised strategy. We guide complementary aggregation by leveraging the differences between the alpha maps of the two 3DGS from different viewpoints.
Figure 4: Illustration of geometry-guided texture refinement. (a) illustrates artifact localization in novel views, while (b) shows geometry mask-guided sequence in-painting.
Figure 5: Qualitative visualization of rendered images and depth maps on the ImageNet dataset. Our method can generate novel view images along with corresponding depth maps for input images across various categories.
...and 9 more figures

F3D-Gaus: Feed-forward 3D-aware Generation on ImageNet with Cycle-Aggregative Gaussian Splatting

TL;DR

Abstract

F3D-Gaus: Feed-forward 3D-aware Generation on ImageNet with Cycle-Aggregative Gaussian Splatting

Authors

TL;DR

Abstract

Table of Contents

Figures (14)