Table of Contents
Fetching ...

Fus3D: Decoding Consolidated 3D Geometry from Feed-forward Geometry Transformer Latents

Laura Fink, Linus Franke, George Kopanas, Marc Stamminger, Peter Hedman

Abstract

We propose a feed-forward method for dense Signed Distance Field (SDF) regression from unstructured image collections in less than three seconds, without camera calibration or post-hoc fusion. Our key insight is that the intermediate feature space of pretrained multi-view feed-forward geometry transformers already encodes a powerful joint world representation; yet, existing pipelines discard it, routing features through per-view prediction heads before assembling 3D geometry post-hoc, which discards valuable completeness information and accumulates inaccuracies. We instead perform 3D extraction directly from geometry transformer features via learned volumetric extraction: voxelized canonical embeddings that progressively absorb multi-view geometry information through interleaved cross- and self-attention into a structured volumetric latent grid. A simple convolutional decoder then maps this grid to a dense SDF. We additionally propose a scalable, validity-aware supervision scheme directly using SDFs derived from depth maps or 3D assets, tackling practical issues like non-watertight meshes. Our approach yields complete and well-defined distance values across sparse- and dense-view settings and demonstrates geometrically plausible completions. Code and further material can be found at https://lorafib.github.io/fus3d.

Fus3D: Decoding Consolidated 3D Geometry from Feed-forward Geometry Transformer Latents

Abstract

We propose a feed-forward method for dense Signed Distance Field (SDF) regression from unstructured image collections in less than three seconds, without camera calibration or post-hoc fusion. Our key insight is that the intermediate feature space of pretrained multi-view feed-forward geometry transformers already encodes a powerful joint world representation; yet, existing pipelines discard it, routing features through per-view prediction heads before assembling 3D geometry post-hoc, which discards valuable completeness information and accumulates inaccuracies. We instead perform 3D extraction directly from geometry transformer features via learned volumetric extraction: voxelized canonical embeddings that progressively absorb multi-view geometry information through interleaved cross- and self-attention into a structured volumetric latent grid. A simple convolutional decoder then maps this grid to a dense SDF. We additionally propose a scalable, validity-aware supervision scheme directly using SDFs derived from depth maps or 3D assets, tackling practical issues like non-watertight meshes. Our approach yields complete and well-defined distance values across sparse- and dense-view settings and demonstrates geometrically plausible completions. Code and further material can be found at https://lorafib.github.io/fus3d.

Paper Structure

This paper contains 55 sections, 4 equations, 22 figures, 4 tables.

Figures (22)

  • Figure 1: Fus3D directly regresses a 3D representation from the latent space of a multi-view geometry transformer, bypassing per-view prediction and post-hoc fusion. This, compared to VGGT wang2025vggt using TSDF fusion, yields improved surface completeness under sparse views (left, 2 views) and avoids error accumulation reducing details at scale (right, 24 views), also reflected in the F-score curves across view counts (center).
  • Figure 2: Juxtaposition of geometric extraction metholodogies. Common pipelines like VGGT wang2025vggt (top) route transformer features through per-view 2D decoder heads, discarding the joint multi-view representation before 3D assembly. We instead extract dense 3D features directly from the transformer's intermediate feature space, preserving the full multi-view information.
  • Figure 3: Architecture of Fus3D: The geometry transformer $\mathcal{G}$ (beige) processes tokenized input images, yielding a list of 2D intermediate features $\{z^b_\textrm{2D}\}^B_{b=1}$ extracted from different stages. The extraction transformer $\mathcal{E}$ (orange) leverages 2D-to-3D cross attention (green) to absorb the 3D information into features of the learned canonical embedding $z_\textrm{3D}$, and distributes this information via 3D self-attention (blue) throughout the volumetric latent. The head $\mathcal{H}_{3D}$ decodes the resulting structured latent $\hat{z}_\textrm{3D}$ into a dense SDF grid.
  • Figure 4: Qualitative results on DTU. Beige boxes indicate input views. Numbers in brackets correspond to image indices of "favorable" (23, 24, 33) and "unfavorable" (1, 16, 36) view combinations. (Following the baselines' evaluation protocols, only visible geometry is evaluated.)
  • Figure 5: Comparison vs ground-truth TSDF isosurfaces (green) on 170 Objaverse test scenes, w.r.t number of input views. Left: F-scores with threshold $0.5\epsilon$. Right: Chamfer distances.
  • ...and 17 more figures