Table of Contents
Fetching ...

Off The Grid: Detection of Primitives for Feed-Forward 3D Gaussian Splatting

Arthur Moreau, Richard Shaw, Michal Nazarczuk, Jisu Shin, Thomas Tanay, Zhensong Zhang, Songcen Xu, Eduardo Pérez-Pellitero

TL;DR

This work tackles the inefficiency and limited fidelity of pixel-aligned Gaussian primitives in feed-forward 3D Gaussian Splatting. It introduces Off-The-Grid Gaussians, a sub-pixel primitive-detection mechanism coupled with a multi-density decoder that adaptively allocates primitives across image patches, and a self-supervised training loop built on a VGGT backbone to place Gaussians on a predicted geometry. Key contributions include a detection-based 3D Gaussian decoder, adaptive density control, and a teacher-regularized, self-supervised refinement that improves camera pose estimation while reducing artifacts. The approach achieves state-of-the-art novel view synthesis with far fewer primitives and demonstrates potential for self-supervised improvement of 3D foundation models without annotated data.

Abstract

Feed-forward 3D Gaussian Splatting (3DGS) models enable real-time scene generation but are hindered by suboptimal pixel-aligned primitive placement, which relies on a dense, rigid grid and limits both quality and efficiency. We introduce a new feed-forward architecture that detects 3D Gaussian primitives at a sub-pixel level, replacing the pixel grid with an adaptive, "Off The Grid" distribution. Inspired by keypoint detection, our multi-resolution decoder learns to distribute primitives across image patches. This module is trained end-to-end with a 3D reconstruction backbone using self-supervised learning. Our resulting pose-free model generates photorealistic scenes in seconds, achieving state-of-the-art novel view synthesis for feed-forward models. It outperforms competitors while using far fewer primitives, demonstrating a more accurate and efficient allocation that captures fine details and reduces artifacts. Moreover, we observe that by learning to render 3D Gaussians, our 3D reconstruction backbone improves camera pose estimation, suggesting opportunities to train these foundational models without labels.

Off The Grid: Detection of Primitives for Feed-Forward 3D Gaussian Splatting

TL;DR

This work tackles the inefficiency and limited fidelity of pixel-aligned Gaussian primitives in feed-forward 3D Gaussian Splatting. It introduces Off-The-Grid Gaussians, a sub-pixel primitive-detection mechanism coupled with a multi-density decoder that adaptively allocates primitives across image patches, and a self-supervised training loop built on a VGGT backbone to place Gaussians on a predicted geometry. Key contributions include a detection-based 3D Gaussian decoder, adaptive density control, and a teacher-regularized, self-supervised refinement that improves camera pose estimation while reducing artifacts. The approach achieves state-of-the-art novel view synthesis with far fewer primitives and demonstrates potential for self-supervised improvement of 3D foundation models without annotated data.

Abstract

Feed-forward 3D Gaussian Splatting (3DGS) models enable real-time scene generation but are hindered by suboptimal pixel-aligned primitive placement, which relies on a dense, rigid grid and limits both quality and efficiency. We introduce a new feed-forward architecture that detects 3D Gaussian primitives at a sub-pixel level, replacing the pixel grid with an adaptive, "Off The Grid" distribution. Inspired by keypoint detection, our multi-resolution decoder learns to distribute primitives across image patches. This module is trained end-to-end with a 3D reconstruction backbone using self-supervised learning. Our resulting pose-free model generates photorealistic scenes in seconds, achieving state-of-the-art novel view synthesis for feed-forward models. It outperforms competitors while using far fewer primitives, demonstrating a more accurate and efficient allocation that captures fine details and reduces artifacts. Moreover, we observe that by learning to render 3D Gaussians, our 3D reconstruction backbone improves camera pose estimation, suggesting opportunities to train these foundational models without labels.

Paper Structure

This paper contains 32 sections, 1 equation, 9 figures, 5 tables.

Figures (9)

  • Figure 1: 3D Gaussians models obtained through different decoding strategies. By learning the position of primitives instead of using regular grids, our models represents the scene more accurately and uses less primitives. Models created from 6 images. Voxel-aligned uses AnySplat jiang2025anysplat and Pixel-aligned is an ablated version of our model.
  • Figure 2: Overview of our pose-free 3DGS framework. Our method processes depth and camera parameters from N input images with a large pretrained Multi-View transformer. Then, our 3D Gaussian decoder (described in Figure \ref{['fig:decoder']}) predicts a local Gaussian model for each image, which is rendered and aggregated with other views. The pipeline is trained end-to-end to reconstruct input images, with geometry consistency and regularization losses.
  • Figure 3: Overview of our 3D Gaussian decoder architecture. Images, depth maps, and latent features are concatenated and fed to a U-Net CNN from which detection and description features are extracted. First, the position of detected primitives is determined from convolutional heatmaps. Then, image, depths and description features are bilinearly interpolated to decode Gaussian parameters through depth unprojection and MLP.
  • Figure 4: Spatial distribution of detection across image patches. We observe the distribution of our heatmaps $H$ that are used to compute the detected Gaussians. For each density level, we display the average activation of each channel. Most Gaussians appear to operate on a local area of the patch, especially at low density. At high density, some channels are specialized for borders or corners, some others have a widespread distribution enabling to be allocated dynamically to highly detailed areas.
  • Figure 5: Confidence maps depending on number of views. Our model shows multi-view awareness when predicting confidence, removing primitives which are better observed in other views. In the example, one face of the cube is viewed from the side in image 4 and from the front in image 6. When the model sees image 6, Gaussians from image 4 are discarded. Green is high confidence.
  • ...and 4 more figures