EAR-Net: Pursuing End-to-End Absolute Rotations from Multi-View Images

Yuzhen Liu; Qiulei Dong

EAR-Net: Pursuing End-to-End Absolute Rotations from Multi-View Images

Yuzhen Liu, Qiulei Dong

TL;DR

Absolute rotation estimation from multi-view images is traditionally tackled by multi-stage pipelines that accumulate errors. EAR-Net introduces a fully end-to-end framework with two key modules: the Epipolar Confidence Graph Construction (ECGC) that learns pairwise relative rotations and their confidences, and the Confidence-Aware Rotation Averaging (CARA) that uses these confidences in a differentiable optimization to predict global rotations. Central to the approach are the confidence-based losses, a confidence-aware initialization via a maximum spanning tree, and a Lie-algebra based iterative optimization, enabling robust end-to-end learning. Across ScanNet, DTU, and 7-Scene, EAR-Net achieves state-of-the-art accuracy and speed, including strong cross-dataset generalization and resilience to outliers, demonstrating practical impact for fast, reliable global camera pose estimation without hand-crafted preprocessing.

Abstract

Absolute rotation estimation is an important topic in 3D computer vision. Existing works in literature generally employ a multi-stage (at least two-stage) estimation strategy where multiple independent operations (feature matching, two-view rotation estimation, and rotation averaging) are implemented sequentially. However, such a multi-stage strategy inevitably leads to the accumulation of the errors caused by each involved operation, and degrades its final estimation on global rotations accordingly. To address this problem, we propose an End-to-end method for estimating Absolution Rotations from multi-view images based on deep neural Networks, called EAR-Net. The proposed EAR-Net consists of an epipolar confidence graph construction module and a confidence-aware rotation averaging module. The epipolar confidence graph construction module is explored to simultaneously predict pairwise relative rotations among the input images and their corresponding confidences, resulting in a weighted graph (called epipolar confidence graph). Based on this graph, the confidence-aware rotation averaging module, which is differentiable, is explored to predict the absolute rotations. Thanks to the introduced confidences of the relative rotations, the proposed EAR-Net could effectively handle outlier cases. Experimental results on three public datasets demonstrate that EAR-Net outperforms the state-of-the-art methods by a large margin in terms of accuracy and speed.

EAR-Net: Pursuing End-to-End Absolute Rotations from Multi-View Images

TL;DR

Abstract

Paper Structure (20 sections, 9 equations, 10 figures, 6 tables, 1 algorithm)

This paper contains 20 sections, 9 equations, 10 figures, 6 tables, 1 algorithm.

Introduction
Related Work
Relative Rotation Estimation
Rotation Averaging
Methodology
Architecture
Epipolar Confidence Graph Construction Module
Confidence-Aware Rotation Averaging Module
Training Strategy and Loss Function
Extension to Large-Scale Scenes
Experiments
Datasets and Evaluation Metrics
Implementation Details
Comparative Evaluation Under Basic Setup
Comparative Evaluation Under Large-Scale Setup
...and 5 more sections

Figures (10)

Figure 1: Pipeline of the traditional strategy in literature for recovering absolute rotations from a given set of images. It generally consists of multiple stages, including (a) feature matching, (b) two-view rotation estimation and (c) rotation averaging.
Figure 2: Architecture of the proposed EAR-Net, which consists of the epipolar confidence graph construction module for learning relative rotations and their confidences, and the confidence-aware rotation averaging module for predicting the absolute rotations.
Figure 3: Architecture of the pairwise feature aggregation (PFA) unit. This unit takes two image feature maps as input and outputs the pairwise feature vector.
Figure 4: Architecture of the dual-branch decoder, which consists of the rotation branch to predict the relative rotation, and the confidence branch to predict the corresponding confidence. 'FC' denotes the fully connected layer.
Figure 5: Comparison of inference speed on the ScanNet dataset ScanNet. All the methods are tested on an RTX 2080Ti GPU with batch size 1.
...and 5 more figures

EAR-Net: Pursuing End-to-End Absolute Rotations from Multi-View Images

TL;DR

Abstract

EAR-Net: Pursuing End-to-End Absolute Rotations from Multi-View Images

Authors

TL;DR

Abstract

Table of Contents

Figures (10)