Table of Contents
Fetching ...

IUP-Pose: Decoupled Iterative Uncertainty Propagation for Real-time Relative Pose Regression via Implicit Dense Alignment v1

Jun Wang, Xiaoyan Huang

Abstract

Relative pose estimation is fundamental for SLAM, visual localization, and 3D reconstruction. Existing Relative Pose Regression (RPR) methods face a key trade-off: feature-matching pipelines achieve high accuracy but block gradient flow via non-differentiable RANSAC, while ViT-based regressors are end-to-end trainable but prohibitively expensive for real-time deployment. We identify the core bottlenecks as the coupling between rotation and translation estimation and insufficient cross-view feature alignment. We propose IUP-Pose, a geometry-driven decoupled iterative framework with implicit dense alignment. A lightweight Multi-Head Bi-Cross Attention (MHBC) module aligns cross-view features without explicit matching supervision. The aligned features are processed by a decoupled rotation-translation pipeline: two shared-parameter rotation stages iteratively refine rotation with uncertainty, and feature maps are realigned via rotational homography H_inf before translation prediction. IUP-Pose achieves 73.3% AUC@20deg on MegaDepth1500 with full end-to-end differentiability, 70 FPS throughput, and only 37M parameters, demonstrating a favorable accuracy-efficiency trade-off for real-time edge deployment.

IUP-Pose: Decoupled Iterative Uncertainty Propagation for Real-time Relative Pose Regression via Implicit Dense Alignment v1

Abstract

Relative pose estimation is fundamental for SLAM, visual localization, and 3D reconstruction. Existing Relative Pose Regression (RPR) methods face a key trade-off: feature-matching pipelines achieve high accuracy but block gradient flow via non-differentiable RANSAC, while ViT-based regressors are end-to-end trainable but prohibitively expensive for real-time deployment. We identify the core bottlenecks as the coupling between rotation and translation estimation and insufficient cross-view feature alignment. We propose IUP-Pose, a geometry-driven decoupled iterative framework with implicit dense alignment. A lightweight Multi-Head Bi-Cross Attention (MHBC) module aligns cross-view features without explicit matching supervision. The aligned features are processed by a decoupled rotation-translation pipeline: two shared-parameter rotation stages iteratively refine rotation with uncertainty, and feature maps are realigned via rotational homography H_inf before translation prediction. IUP-Pose achieves 73.3% AUC@20deg on MegaDepth1500 with full end-to-end differentiability, 70 FPS throughput, and only 37M parameters, demonstrating a favorable accuracy-efficiency trade-off for real-time edge deployment.
Paper Structure (15 sections, 15 equations, 5 figures, 3 tables)

This paper contains 15 sections, 15 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Speed-accuracy trade-off on MegaDepth1500. We evaluate our method against state-of-the-art relative pose estimators. Inference speeds are measured on a single NVIDIA RTX 4090 GPU with latency per image pair (lower is better). Blue markers represent correspondence-based methods; red markers denote regression-based approaches. Our IUP-Pose achieves the lowest latency of 14.3ms (70 FPS), demonstrating superior efficiency while maintaining competitive accuracy at AUC@$20^\circ$ of 73.3%.
  • Figure 2: Overall architecture of IUP-Pose. Our framework adopts a decoupled strategy with three main components: (1) Input & Encoder: RGB images concatenated with normalized coordinates form 5-channel inputs, processed by a ResNet encoder to extract multi-scale features. (2) Implicit Dense Alignment (IDA): SPPF khanam2024yolov5 and multi-head bi-cross attention (MHBC) reduce cross-view domain shift of features. (3) Decoupled Rotation-Translation Estimation: The rotation decoder (RD) iteratively refines $\mathbf{R}$ and $\sigma_R$; homography warp (HW) eliminates rotational disparity between views; rotation fusion (RF) produces final $\mathbf{R}_{final}$; the translation decoder (TD) estimates $\mathbf{t}_{final}$.
  • Figure 3: Visual disparities in MegaDepth dataset. Representative image pairs from MegaDepth exhibiting significant visual challenges: (a) drastic viewpoint changes, (b) inconsistent camera intrinsics, (c) illumination differences, and (d) partial occlusions.
  • Figure 4: Decoder Architecture. Both rotation and translation decoders share this architecture. (a) Overall decoder: Multi-view features $\mathbf{F}_0$ and $\mathbf{F}_1$ pass through MoE adapter (2 experts) and View Fusion to produce fused features ($B \!\times\! C \!\times\! H \!\times\! W$). FiLM conditioning with camera intrinsics $K$, input pose $R_{in}$, and uncertainty $\sigma_{in}$ is followed by convolutional layers and pooling to output $\mathbf{R}_{out}$ (or $\mathbf{t}_{out}$) with $\sigma_{out}$. Residual connection (dashed red) preserves input information. (b) View Fusion: Features undergo 4 iterations of Robust Bottleneck Blocks with V-aware gates, where kernel size $k_v$ grows from 1 to 3 for progressive cross-view mixing. After depthwise separable refinement and residual addition (dashed red), V-wise softmax attention fuses views to output $B \!\times\! C \!\times\! H \!\times\! W$.
  • Figure : (a) IDA attention heatmaps