Table of Contents
Fetching ...

Any Resolution Any Geometry: From Multi-View To Multi-Patch

Wenqing Cui, Zhenyu Li, Mykola Lavreniuk, Jian Shi, Ramzi Idoughi, Xiangjun Tang, Peter Wonka

TL;DR

The Ultra Resolution Geometry Transformer (URGT), which adapts the Visual Geometry Grounded Transformer (VGGT) into a unified multi-patch transformer for monocular high-resolution depth--normal estimation, is proposed, offering an efficient and extensible solution for high-quality geometry refinement.

Abstract

Joint estimation of surface normals and depth is essential for holistic 3D scene understanding, yet high-resolution prediction remains difficult due to the trade-off between preserving fine local detail and maintaining global consistency. To address this challenge, we propose the Ultra Resolution Geometry Transformer (URGT), which adapts the Visual Geometry Grounded Transformer (VGGT) into a unified multi-patch transformer for monocular high-resolution depth--normal estimation. A single high-resolution image is partitioned into patches that are augmented with coarse depth and normal priors from pre-trained models, and jointly processed in a single forward pass to predict refined geometric outputs. Global coherence is enforced through cross-patch attention, which enables long-range geometric reasoning and seamless propagation of information across patches within a shared backbone. To further enhance spatial robustness, we introduce a GridMix patch sampling strategy that probabilistically samples grid configurations during training, improving inter-patch consistency and generalization. Our method achieves state-of-the-art results on UnrealStereo4K, jointly improving depth and normal estimation, reducing AbsRel from 0.0582 to 0.0291, RMSE from 2.17 to 1.31, and lowering mean angular error from 23.36 degrees to 18.51 degrees, while producing sharper and more stable geometry. The proposed multi-patch framework also demonstrates strong zero-shot and cross-domain generalization and scales effectively to very high resolutions, offering an efficient and extensible solution for high-quality geometry refinement.

Any Resolution Any Geometry: From Multi-View To Multi-Patch

TL;DR

The Ultra Resolution Geometry Transformer (URGT), which adapts the Visual Geometry Grounded Transformer (VGGT) into a unified multi-patch transformer for monocular high-resolution depth--normal estimation, is proposed, offering an efficient and extensible solution for high-quality geometry refinement.

Abstract

Joint estimation of surface normals and depth is essential for holistic 3D scene understanding, yet high-resolution prediction remains difficult due to the trade-off between preserving fine local detail and maintaining global consistency. To address this challenge, we propose the Ultra Resolution Geometry Transformer (URGT), which adapts the Visual Geometry Grounded Transformer (VGGT) into a unified multi-patch transformer for monocular high-resolution depth--normal estimation. A single high-resolution image is partitioned into patches that are augmented with coarse depth and normal priors from pre-trained models, and jointly processed in a single forward pass to predict refined geometric outputs. Global coherence is enforced through cross-patch attention, which enables long-range geometric reasoning and seamless propagation of information across patches within a shared backbone. To further enhance spatial robustness, we introduce a GridMix patch sampling strategy that probabilistically samples grid configurations during training, improving inter-patch consistency and generalization. Our method achieves state-of-the-art results on UnrealStereo4K, jointly improving depth and normal estimation, reducing AbsRel from 0.0582 to 0.0291, RMSE from 2.17 to 1.31, and lowering mean angular error from 23.36 degrees to 18.51 degrees, while producing sharper and more stable geometry. The proposed multi-patch framework also demonstrates strong zero-shot and cross-domain generalization and scales effectively to very high resolutions, offering an efficient and extensible solution for high-quality geometry refinement.
Paper Structure (33 sections, 12 equations, 13 figures, 6 tables)

This paper contains 33 sections, 12 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: Multi-patch transformer pipeline for high-resolution geometry refinement. A coarse depth/normal prediction is first obtained and upsampled, after which the image and coarse outputs are patchified and encoded with DINOv2 and global positional embeddings. All patch tokens are jointly processed through transformer blocks with intra- and cross-patch attention, enabling global geometric reasoning. A DPT head predicts offsets that are fused with the coarse input to produce a globally consistent, detail-preserving high-resolution output.
  • Figure 2: Illustration of the four GridMix patch sampling strategy configurations used in the training strategy: (from left to right) a single randomly sampled patch, a $2 \times 2$ grid of patches, a $3\times3$ grid of patches, and a $4\times4$ grid covering the entire image. The red grids represent the patch arrangements, while the green regions indicate the valid areas for randomly selecting the top-left corner of the grid, ensuring the entire grid remains fully within the image boundaries.
  • Figure 3: Qualitative comparison on the UnrealStereo4Ku4k dataset. Red rectangles indicate the bounding boxes of zoomed-in regions. The second and fourth rows show close-up (zoom-in) views corresponding to the first and third rows, respectively. The rightmost column presents the RGB input and ground-truth depth maps. Compared to previous methods, our predictions exhibit smoother depth continuity and sharper geometry boundaries.
  • Figure 4: Qualitative comparison on the Zero-Shot depth estimation task. Each row shows samples from different scenes. The rightmost column presents the RGB input. Compared to previous models, our zero-shot predictions demonstrate improved geometric consistency and sharper depth boundaries across diverse domains.
  • Figure 5: Qualitative comparison of surface normal estimation on the UnrealStereo4K u4k dataset. Our method produces more accurate and spatially consistent surface normals compared to Metric3D V2.
  • ...and 8 more figures