Table of Contents
Fetching ...

Fusing Structure from Motion and Simulation-Augmented Pose Regression from Optical Flow for Challenging Indoor Environments

Felix Ott, Lucas Heublein, David Rügamer, Bernd Bischl, Christopher Mutschler

TL;DR

This work tackles the problem of robust indoor localization from monocular imagery by fusing absolute poses derived from Structure from Motion (SfM) or Absolute Pose Regression (APR) with relative poses from Relative Pose Regression (RPR) based on optical flow. The authors introduce recurrent fusion networks to optimally align and smooth the combined pose stream, compare against Pose Graph Optimization (PGO), and demonstrate substantial improvements across a large, challenging warehouse-like dataset. A key contribution is simulation-augmented pre-training that uses synthetic data to initialize APR and RPR, boosting generalization to unseen configurations. The results show that recurrent fusion—especially with a strongly typed TRNN cell and two stacked layers—consistently outperforms PGO and non-fusion baselines, while a public Industry dataset and synthetic pre-training facilitate broader applicability. Overall, the approach enhances localization robustness against environmental changes, motion dynamics, and feature-poor scenes, with practical implications for robotics and warehouse automation.

Abstract

The localization of objects is a crucial task in various applications such as robotics, virtual and augmented reality, and the transportation of goods in warehouses. Recent advances in deep learning have enabled the localization using monocular visual cameras. While structure from motion (SfM) predicts the absolute pose from a point cloud, absolute pose regression (APR) methods learn a semantic understanding of the environment through neural networks. However, both fields face challenges caused by the environment such as motion blur, lighting changes, repetitive patterns, and feature-less structures. This study aims to address these challenges by incorporating additional information and regularizing the absolute pose using relative pose regression (RPR) methods. RPR methods suffer under different challenges, i.e., motion blur. The optical flow between consecutive images is computed using the Lucas-Kanade algorithm, and the relative pose is predicted using an auxiliary small recurrent convolutional network. The fusion of absolute and relative poses is a complex task due to the mismatch between the global and local coordinate systems. State-of-the-art methods fusing absolute and relative poses use pose graph optimization (PGO) to regularize the absolute pose predictions using relative poses. In this work, we propose recurrent fusion networks to optimally align absolute and relative pose predictions to improve the absolute pose prediction. We evaluate eight different recurrent units and construct a simulation environment to pre-train the APR and RPR networks for better generalized training. Additionally, we record a large database of different scenarios in a challenging large-scale indoor environment that mimics a warehouse with transportation robots. We conduct hyperparameter searches and experiments to show the effectiveness of our recurrent fusion method compared to PGO.

Fusing Structure from Motion and Simulation-Augmented Pose Regression from Optical Flow for Challenging Indoor Environments

TL;DR

This work tackles the problem of robust indoor localization from monocular imagery by fusing absolute poses derived from Structure from Motion (SfM) or Absolute Pose Regression (APR) with relative poses from Relative Pose Regression (RPR) based on optical flow. The authors introduce recurrent fusion networks to optimally align and smooth the combined pose stream, compare against Pose Graph Optimization (PGO), and demonstrate substantial improvements across a large, challenging warehouse-like dataset. A key contribution is simulation-augmented pre-training that uses synthetic data to initialize APR and RPR, boosting generalization to unseen configurations. The results show that recurrent fusion—especially with a strongly typed TRNN cell and two stacked layers—consistently outperforms PGO and non-fusion baselines, while a public Industry dataset and synthetic pre-training facilitate broader applicability. Overall, the approach enhances localization robustness against environmental changes, motion dynamics, and feature-poor scenes, with practical implications for robotics and warehouse automation.

Abstract

The localization of objects is a crucial task in various applications such as robotics, virtual and augmented reality, and the transportation of goods in warehouses. Recent advances in deep learning have enabled the localization using monocular visual cameras. While structure from motion (SfM) predicts the absolute pose from a point cloud, absolute pose regression (APR) methods learn a semantic understanding of the environment through neural networks. However, both fields face challenges caused by the environment such as motion blur, lighting changes, repetitive patterns, and feature-less structures. This study aims to address these challenges by incorporating additional information and regularizing the absolute pose using relative pose regression (RPR) methods. RPR methods suffer under different challenges, i.e., motion blur. The optical flow between consecutive images is computed using the Lucas-Kanade algorithm, and the relative pose is predicted using an auxiliary small recurrent convolutional network. The fusion of absolute and relative poses is a complex task due to the mismatch between the global and local coordinate systems. State-of-the-art methods fusing absolute and relative poses use pose graph optimization (PGO) to regularize the absolute pose predictions using relative poses. In this work, we propose recurrent fusion networks to optimally align absolute and relative pose predictions to improve the absolute pose prediction. We evaluate eight different recurrent units and construct a simulation environment to pre-train the APR and RPR networks for better generalized training. Additionally, we record a large database of different scenarios in a challenging large-scale indoor environment that mimics a warehouse with transportation robots. We conduct hyperparameter searches and experiments to show the effectiveness of our recurrent fusion method compared to PGO.
Paper Structure (36 sections, 4 equations, 217 figures, 8 tables)

This paper contains 36 sections, 4 equations, 217 figures, 8 tables.

Figures (217)

  • Figure 1: Method overview. First, a point cloud is constructed from an image dataset using SfM to extract features, spatial consistency, overlap criterion, cluster sampling, and bundle adjustment (BA). Second, we train a small convolutional recurrent neural network to predict the relative pose $\Delta \mathbf{x}$ between two consecutive images. For the training of the fusion model and the evaluation step, the absolute pose $\mathbf{x}$ from the point cloud and the relative pose $\Delta \mathbf{x}^{tr}$ from the RPR model is retrieved for a query image. Last the absolute pose is optimized with either PGO or a recurrent network. To compare with state-of-the-art methods, we replace the reconstruction and the point cloud in the prediction steps with the absolute pose prediction from the APR model.
  • Figure 2: SfM pipeline using bundle adjustment (BA) to reconstruct a point cloud from input images.
  • Figure 3: Images of timestep $t_{n-1}$.
  • Figure 4: Images of timestep $t_{n}$.
  • Figure 5: Optical flow visualization.
  • ...and 212 more figures