Table of Contents
Fetching ...

UFO-4D: Unposed Feedforward 4D Reconstruction from Two Images

Junhwa Hur, Charles Herrmann, Songyou Peng, Philipp Henzler, Zeyu Ma, Todd Zickler, Deqing Sun

TL;DR

UFO-4D is introduced, a unified feedforward framework to reconstruct a dense, explicit 4D representation from just a pair of unposed images, enabling a self-supervised image synthesis loss while tightly coupling appearance, depth, and motion.

Abstract

Dense 4D reconstruction from unposed images remains a critical challenge, with current methods relying on slow test-time optimization or fragmented, task-specific feedforward models. We introduce UFO-4D, a unified feedforward framework to reconstruct a dense, explicit 4D representation from just a pair of unposed images. UFO-4D directly estimates dynamic 3D Gaussian Splats, enabling the joint and consistent estimation of 3D geometry, 3D motion, and camera pose in a feedforward manner. Our core insight is that differentiably rendering multiple signals from a single Dynamic 3D Gaussian representation offers major training advantages. This approach enables a self-supervised image synthesis loss while tightly coupling appearance, depth, and motion. Since all modalities share the same geometric primitives, supervising one inherently regularizes and improves the others. This synergy overcomes data scarcity, allowing UFO-4D to outperform prior work by up to 3 times in joint geometry, motion, and camera pose estimation. Our representation also enables high-fidelity 4D interpolation across novel views and time. Please visit our project page for visual results: https://ufo-4d.github.io/

UFO-4D: Unposed Feedforward 4D Reconstruction from Two Images

TL;DR

UFO-4D is introduced, a unified feedforward framework to reconstruct a dense, explicit 4D representation from just a pair of unposed images, enabling a self-supervised image synthesis loss while tightly coupling appearance, depth, and motion.

Abstract

Dense 4D reconstruction from unposed images remains a critical challenge, with current methods relying on slow test-time optimization or fragmented, task-specific feedforward models. We introduce UFO-4D, a unified feedforward framework to reconstruct a dense, explicit 4D representation from just a pair of unposed images. UFO-4D directly estimates dynamic 3D Gaussian Splats, enabling the joint and consistent estimation of 3D geometry, 3D motion, and camera pose in a feedforward manner. Our core insight is that differentiably rendering multiple signals from a single Dynamic 3D Gaussian representation offers major training advantages. This approach enables a self-supervised image synthesis loss while tightly coupling appearance, depth, and motion. Since all modalities share the same geometric primitives, supervising one inherently regularizes and improves the others. This synergy overcomes data scarcity, allowing UFO-4D to outperform prior work by up to 3 times in joint geometry, motion, and camera pose estimation. Our representation also enables high-fidelity 4D interpolation across novel views and time. Please visit our project page for visual results: https://ufo-4d.github.io/
Paper Structure (41 sections, 7 equations, 15 figures, 12 tables)

This paper contains 41 sections, 7 equations, 15 figures, 12 tables.

Figures (15)

  • Figure 1: Given a pair of unposed images, the proposed UFO-4D outputs dynamic 3D Gaussians in the canonical space and relative camera pose in a feedforward manner. This explicit 4D representation can solve various downstream tasks such as 3D geometry (point, depth) and motion (scene flow, optical flow). Besides, it can interpolate image, geometry, and motion at novel view and time.
  • Figure 2: Network architecture. Given a pair of input images $\mathbf{I}_t$ and $\mathbf{I}_{t+1}$ and camera intrinsics $\mathbf{K}$, UFO-4D outputs parameters for dynamic 3D Gaussians and relative camera pose in a feed-forward manner. Given the estimates, UFO-4D can render image, point, and motion at any interpolated time and view. While the intrinsic token needs the real camera intrinsics, the pose token is a learnable parameter and does not require the inference-time pose input.
  • Figure 3: (a) Each Gaussian is translated with its motion to represent 3D scene at time $t+\Delta t$. (b) Point and motion as well as an image are rasterized together.
  • Figure 4: Qualitative comparison of depth and projected 2D optical flow on Stereo4D, Bonn, and KITTI. For motion on KITTI, it visualizes motion relative to the camera, as GT is defined. Unlike DynaDUSt3R, ZeroMSF and St4RTrack, which suffers from residual motions in static region and inaccurate motion on object boundaries, UFO-4D exhibits clear motion boundaries and separation between moving objects and background. More qualitative results are in \ref{['supp:additional_qualitative']}.
  • Figure 5: Opacity as learnable confidence: Opacity maps show the model's behavior in a (dis)occlusion scenario. Our model learns to assign high confidence (opacity) to disoccluded regions, and for mutually-visible regions, it selects only one corresponding Gaussian from the two views, enabling an efficient and compact 4D representation.
  • ...and 10 more figures