Table of Contents
Fetching ...

Consensus Learning with Deep Sets for Essential Matrix Estimation

Dror Moran, Yuval Margalit, Guy Trostianetsky, Fadi Khatib, Meirav Galun, Ronen Basri

TL;DR

A simpler network architecture based on Deep Sets is proposed, which identifies outlier point matches and models the displacement noise in inlier matches and achieves accurate recovery that is superior to existing networks with significantly more complex architectures.

Abstract

Robust estimation of the essential matrix, which encodes the relative position and orientation of two cameras, is a fundamental step in structure from motion pipelines. Recent deep-based methods achieved accurate estimation by using complex network architectures that involve graphs, attention layers, and hard pruning steps. Here, we propose a simpler network architecture based on Deep Sets. Given a collection of point matches extracted from two images, our method identifies outlier point matches and models the displacement noise in inlier matches. A weighted DLT module uses these predictions to regress the essential matrix. Our network achieves accurate recovery that is superior to existing networks with significantly more complex architectures.

Consensus Learning with Deep Sets for Essential Matrix Estimation

TL;DR

A simpler network architecture based on Deep Sets is proposed, which identifies outlier point matches and models the displacement noise in inlier matches and achieves accurate recovery that is superior to existing networks with significantly more complex architectures.

Abstract

Robust estimation of the essential matrix, which encodes the relative position and orientation of two cameras, is a fundamental step in structure from motion pipelines. Recent deep-based methods achieved accurate estimation by using complex network architectures that involve graphs, attention layers, and hard pruning steps. Here, we propose a simpler network architecture based on Deep Sets. Given a collection of point matches extracted from two images, our method identifies outlier point matches and models the displacement noise in inlier matches. A weighted DLT module uses these predictions to regress the essential matrix. Our network achieves accurate recovery that is superior to existing networks with significantly more complex architectures.

Paper Structure

This paper contains 18 sections, 7 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Network architecture. Noise Aware Consensus Network (NACNet) architecture, see text for details.
  • Figure 2: NACNet point location denoising on a line-fitting task. The set $X$ (right panel) is composed of 90% outliers (marked in grey) and (noisy) inliers (red). Our model predicts the denoised version $\hat{X}$ (purple, left panel). Evidently, the prediction of the positional noise, yielding noise-free inliers, agrees with the line model.
  • Figure 3: Distributions of outliers in the different datatsets. Histograms showing for each dataset (YFCC and Sun3D) and feature descriptor (SIFT or SuperPoint) the number of image pairs (the Y-axis) with a given fraction of outlier matches (the X-axis). The means and standard deviations (from left to right) are $0.89 \pm 0.06, 0.77 \pm 0.14, 0.92 \pm 0.08$
  • Figure 4: NACNet inlier/outlier classification. An example from the SUN3D dataset. Left to right: input image pairs, input matches, and our model's predicted inliers. Color mark ground truth labels: inlier matches are marked in green; outliers are marked in red.
  • Figure 5: Denoising evaluation. Reprojection error of inlier keypoints before and after applying our denoising scheme, computed using the ground truth pose. The box plots show the 0.25, 0.5, and 0.75 quantiles. The two left bar plots represent the evaluation over all the image pairs in the YFCC dataset. The right two bar plots focus on image pairs whose pose prediction was accurate (i.e., pose error below $5^{\circ}$, where the pose error is defined as the maximum of the translation and rotation angular errors.). Evaluation was conducted on the YFCC dataset using SIFT descriptors.
  • ...and 2 more figures