Table of Contents
Fetching ...

DAGE: Dual-Stream Architecture for Efficient and Fine-Grained Geometry Estimation

Tuan Duc Ngo, Jiahui Huang, Seoung Wug Oh, Kevin Blackburn-Matzen, Evangelos Kalogerakis, Chuang Gan, Joon-Young Lee

TL;DR

DAGE is presented, a dual-stream transformer whose main novelty is to disentangle global coherence from fine detail, establishing new state-of-the-art results for video geometry estimation and multi-view reconstruction.

Abstract

Estimating accurate, view-consistent geometry and camera poses from uncalibrated multi-view/video inputs remains challenging - especially at high spatial resolutions and over long sequences. We present DAGE, a dual-stream transformer whose main novelty is to disentangle global coherence from fine detail. A low-resolution stream operates on aggressively downsampled frames with alternating frame/global attention to build a view-consistent representation and estimate cameras efficiently, while a high-resolution stream processes the original images per-frame to preserve sharp boundaries and small structures. A lightweight adapter fuses these streams via cross-attention, injecting global context without disturbing the pretrained single-frame pathway. This design scales resolution and clip length independently, supports inputs up to 2K, and maintains practical inference cost. DAGE delivers sharp depth/pointmaps, strong cross-view consistency, and accurate poses, establishing new state-of-the-art results for video geometry estimation and multi-view reconstruction.

DAGE: Dual-Stream Architecture for Efficient and Fine-Grained Geometry Estimation

TL;DR

DAGE is presented, a dual-stream transformer whose main novelty is to disentangle global coherence from fine detail, establishing new state-of-the-art results for video geometry estimation and multi-view reconstruction.

Abstract

Estimating accurate, view-consistent geometry and camera poses from uncalibrated multi-view/video inputs remains challenging - especially at high spatial resolutions and over long sequences. We present DAGE, a dual-stream transformer whose main novelty is to disentangle global coherence from fine detail. A low-resolution stream operates on aggressively downsampled frames with alternating frame/global attention to build a view-consistent representation and estimate cameras efficiently, while a high-resolution stream processes the original images per-frame to preserve sharp boundaries and small structures. A lightweight adapter fuses these streams via cross-attention, injecting global context without disturbing the pretrained single-frame pathway. This design scales resolution and clip length independently, supports inputs up to 2K, and maintains practical inference cost. DAGE delivers sharp depth/pointmaps, strong cross-view consistency, and accurate poses, establishing new state-of-the-art results for video geometry estimation and multi-view reconstruction.
Paper Structure (35 sections, 8 equations, 17 figures, 16 tables)

This paper contains 35 sections, 8 equations, 17 figures, 16 tables.

Figures (17)

  • Figure 1: DAGE produces high-resolution, fine-grained, metric-scale and cross-view consistent 3D geometry together with accurate camera poses from visual inputs. It runs substantially faster than prior models pi3vggt and scales to long sequences (up to 1000 frames).
  • Figure 2: Overview of DAGE. Given a set of unposed RGB images, the model predicts per-frame pointmaps and camera poses, plus a scene-wise metric scale. The architecture has two parallel streams: (i) a low-resolution stream (lower part) that processes downsampled inputs to aggregate global context and regress poses/scene scale; and (ii) a high-resolution stream (upper part) that processes frames independently at native resolution to preserve fine detail. A lightweight Adapter fuses LR and HR tokens before the dense geometry head.
  • Figure 3: The Global transformer (left) operates on low-resolution inputs with alternating global and frame-wise attention; during training, feature distillation compensates for aggressive downsampling. The Adapter (right) stacks cross and self-attention blocks to fuse multi-view–consistent LR tokens into the HR stream.
  • Figure 4: Visual comparison of video depth on in-the-wild scenes. We convert the depth map to a disparity map for better visualization, and zoom-in (red bounding boxes) to emphasize details. DAGE preserves sharp boundaries and fine-grained detail—especially for thin structures and small or distant objects, outperforming a diffusion-based baseline geometrycrafter.
  • Figure 5: Visual comparison of 3D reconstruction on in-the-wild scenes. Compared to VGGT vggt and Pi3 pi3, DAGE achieves comparable multi-view consistency while preserving markedly finer detail (green boxes).
  • ...and 12 more figures