Table of Contents
Fetching ...

Learning Structure-from-Motion with Graph Attention Networks

Lucas Brynte, José Pedro Iglesias, Carl Olsson, Fredrik Kahl

TL;DR

The paper tackles Structure-from-Motion by learning an initialization-free approach using Graph Attention Networks that take sparse 2D keypoints across multiple views and output camera poses and 3D point coordinates. It introduces a multi-type, cross-attentional graph architecture with projection, view, scene-point, and global features, powered by GATv2 layers and 4 attention heads, followed by regression heads for Euclidean or projective reconstructions. Data augmentation and artificial outlier injection are used to improve generalization and robustness, with experiments showing competitive accuracy to COLMAP and better runtime than traditional pipelines, especially after applying BA. The work demonstrates strong generalization to unseen scenes, highlights the benefits of learned priors in SfM, and points to directions for scaling, end-to-end integration, and refined pose representations.

Abstract

In this paper we tackle the problem of learning Structure-from-Motion (SfM) through the use of graph attention networks. SfM is a classic computer vision problem that is solved though iterative minimization of reprojection errors, referred to as Bundle Adjustment (BA), starting from a good initialization. In order to obtain a good enough initialization to BA, conventional methods rely on a sequence of sub-problems (such as pairwise pose estimation, pose averaging or triangulation) which provide an initial solution that can then be refined using BA. In this work we replace these sub-problems by learning a model that takes as input the 2D keypoints detected across multiple views, and outputs the corresponding camera poses and 3D keypoint coordinates. Our model takes advantage of graph neural networks to learn SfM-specific primitives, and we show that it can be used for fast inference of the reconstruction for new and unseen sequences. The experimental results show that the proposed model outperforms competing learning-based methods, and challenges COLMAP while having lower runtime. Our code is available at https://github.com/lucasbrynte/gasfm/.

Learning Structure-from-Motion with Graph Attention Networks

TL;DR

The paper tackles Structure-from-Motion by learning an initialization-free approach using Graph Attention Networks that take sparse 2D keypoints across multiple views and output camera poses and 3D point coordinates. It introduces a multi-type, cross-attentional graph architecture with projection, view, scene-point, and global features, powered by GATv2 layers and 4 attention heads, followed by regression heads for Euclidean or projective reconstructions. Data augmentation and artificial outlier injection are used to improve generalization and robustness, with experiments showing competitive accuracy to COLMAP and better runtime than traditional pipelines, especially after applying BA. The work demonstrates strong generalization to unseen scenes, highlights the benefits of learned priors in SfM, and points to directions for scaling, end-to-end integration, and refined pose representations.

Abstract

In this paper we tackle the problem of learning Structure-from-Motion (SfM) through the use of graph attention networks. SfM is a classic computer vision problem that is solved though iterative minimization of reprojection errors, referred to as Bundle Adjustment (BA), starting from a good initialization. In order to obtain a good enough initialization to BA, conventional methods rely on a sequence of sub-problems (such as pairwise pose estimation, pose averaging or triangulation) which provide an initial solution that can then be refined using BA. In this work we replace these sub-problems by learning a model that takes as input the 2D keypoints detected across multiple views, and outputs the corresponding camera poses and 3D keypoint coordinates. Our model takes advantage of graph neural networks to learn SfM-specific primitives, and we show that it can be used for fast inference of the reconstruction for new and unseen sequences. The experimental results show that the proposed model outperforms competing learning-based methods, and challenges COLMAP while having lower runtime. Our code is available at https://github.com/lucasbrynte/gasfm/.
Paper Structure (32 sections, 5 equations, 15 figures, 13 tables, 6 algorithms)

This paper contains 32 sections, 5 equations, 15 figures, 13 tables, 6 algorithms.

Figures (15)

  • Figure 1: Illustration of the different types of features in the network architecture, for $m=3$ cameras and $n=6$ scene points. The projection features $\mathcal{P}$ are represented by a sparse vector-valued matrix, with its sparsity pattern determined by the point track measurements, while view features $\mathcal{V}$ and scene point features $\mathcal{S}$ are represented by dense vector-values column and row vectors. A single vector $\mathbf{g}$ holds global features.
  • Figure 2: Illustration of the update of a single view feature. All $\mathcal{V}$ features are updated based on their previous value and the corresponding rows of $\mathcal{P}$.
  • Figure 3: Illustration of the update of a single scene point feature. All $\mathcal{S}$ features are updated based on their previous value and the corresponding columns of $\mathcal{P}$.
  • Figure 4: Illustration of the global feature update, where $\mathbf{g}$ is updated based on its previous value, together with an aggregation of all $\mathcal{V}$ and $\mathcal{S}$ features.
  • Figure 5: Illustration of the update of a single projection feature. All $\mathcal{P}$ features are updated based on their previous value, as well as initial projection features $\mathcal{P}_0$, the global feature $\mathbf{g}$, and the corresponding $\mathcal{V}$ and $\mathcal{S}$ features.
  • ...and 10 more figures