F3D-Gaus: Feed-forward 3D-aware Generation on ImageNet with Cycle-Aggregative Gaussian Splatting
Yuxin Wang, Qianyi Wu, Dan Xu
TL;DR
This work tackles generalizable 3D-aware generation from monocular data by introducing F3D-Gaus, a feed-forward framework that predicts pixel-aligned 3D Gaussian Splatting (GS) primitives from RGB-D inputs. A cycle-aggregative self-supervised strategy fuses representations from canonical and novel views via complementary aggregation and cycle supervision, enabling robust cross-view consistency without multi-view supervision. To further polish wide-angle renders, the method employs geometry-guided texture refinement using a video in-painting model guided by artifact masking from alpha and normal maps. Across ImageNet and other datasets, F3D-Gaus achieves state-of-the-art realism (FID/NFS) and competitive depth quality while offering faster training and inference, demonstrating strong generalization and practical applicability for scalable 3D content generation.
Abstract
This paper tackles the problem of generalizable 3D-aware generation from monocular datasets, e.g., ImageNet. The key challenge of this task is learning a robust 3D-aware representation without multi-view or dynamic data, while ensuring consistent texture and geometry across different viewpoints. Although some baseline methods are capable of 3D-aware generation, the quality of the generated images still lags behind state-of-the-art 2D generation approaches, which excel in producing high-quality, detailed images. To address this severe limitation, we propose a novel feed-forward pipeline based on pixel-aligned Gaussian Splatting, coined as F3D-Gaus, which can produce more realistic and reliable 3D renderings from monocular inputs. In addition, we introduce a self-supervised cycle-aggregative constraint to enforce cross-view consistency in the learned 3D representation. This training strategy naturally allows aggregation of multiple aligned Gaussian primitives and significantly alleviates the interpolation limitations inherent in single-view pixel-aligned Gaussian Splatting. Furthermore, we incorporate video model priors to perform geometry-aware refinement, enhancing the generation of fine details in wide-viewpoint scenarios and improving the model's capability to capture intricate 3D textures. Extensive experiments demonstrate that our approach not only achieves high-quality, multi-view consistent 3D-aware generation from monocular datasets, but also significantly improves training and inference efficiency.
