Off The Grid: Detection of Primitives for Feed-Forward 3D Gaussian Splatting
Arthur Moreau, Richard Shaw, Michal Nazarczuk, Jisu Shin, Thomas Tanay, Zhensong Zhang, Songcen Xu, Eduardo Pérez-Pellitero
TL;DR
This work tackles the inefficiency and limited fidelity of pixel-aligned Gaussian primitives in feed-forward 3D Gaussian Splatting. It introduces Off-The-Grid Gaussians, a sub-pixel primitive-detection mechanism coupled with a multi-density decoder that adaptively allocates primitives across image patches, and a self-supervised training loop built on a VGGT backbone to place Gaussians on a predicted geometry. Key contributions include a detection-based 3D Gaussian decoder, adaptive density control, and a teacher-regularized, self-supervised refinement that improves camera pose estimation while reducing artifacts. The approach achieves state-of-the-art novel view synthesis with far fewer primitives and demonstrates potential for self-supervised improvement of 3D foundation models without annotated data.
Abstract
Feed-forward 3D Gaussian Splatting (3DGS) models enable real-time scene generation but are hindered by suboptimal pixel-aligned primitive placement, which relies on a dense, rigid grid and limits both quality and efficiency. We introduce a new feed-forward architecture that detects 3D Gaussian primitives at a sub-pixel level, replacing the pixel grid with an adaptive, "Off The Grid" distribution. Inspired by keypoint detection, our multi-resolution decoder learns to distribute primitives across image patches. This module is trained end-to-end with a 3D reconstruction backbone using self-supervised learning. Our resulting pose-free model generates photorealistic scenes in seconds, achieving state-of-the-art novel view synthesis for feed-forward models. It outperforms competitors while using far fewer primitives, demonstrating a more accurate and efficient allocation that captures fine details and reduces artifacts. Moreover, we observe that by learning to render 3D Gaussians, our 3D reconstruction backbone improves camera pose estimation, suggesting opportunities to train these foundational models without labels.
