Table of Contents
Fetching ...

VMLoc: Variational Fusion For Learning-Based Multimodal Camera Localization

Kaichen Zhou, Changhao Chen, Bing Wang, Muhamad Risqi U. Saputra, Niki Trigoni, Andrew Markham

TL;DR

VMLoc tackles robust camera localization by fusing RGB and depth through a variational latent space learned with Product-of-Experts. It introduces an unbiased, importance-weighted ELBO and a geometric loss to enforce meaningful pose geometry, supported by two modality encoders, a fusion module, and a self-attention-enabled pose regressor to predict 6-DoF pose. Empirical results on indoor 7-Scenes and outdoor Oxford RobotCar demonstrate superior accuracy and robustness over RGB-only baselines and other multimodal variants, including under input degradation. The work advances practical multimodal localization for robotics and autonomous systems, with code available at the project site.

Abstract

Recent learning-based approaches have achieved impressive results in the field of single-shot camera localization. However, how best to fuse multiple modalities (e.g., image and depth) and to deal with degraded or missing input are less well studied. In particular, we note that previous approaches towards deep fusion do not perform significantly better than models employing a single modality. We conjecture that this is because of the naive approaches to feature space fusion through summation or concatenation which do not take into account the different strengths of each modality. To address this, we propose an end-to-end framework, termed VMLoc, to fuse different sensor inputs into a common latent space through a variational Product-of-Experts (PoE) followed by attention-based fusion. Unlike previous multimodal variational works directly adapting the objective function of vanilla variational auto-encoder, we show how camera localization can be accurately estimated through an unbiased objective function based on importance weighting. Our model is extensively evaluated on RGB-D datasets and the results prove the efficacy of our model. The source code is available at https://github.com/kaichen-z/VMLoc.

VMLoc: Variational Fusion For Learning-Based Multimodal Camera Localization

TL;DR

VMLoc tackles robust camera localization by fusing RGB and depth through a variational latent space learned with Product-of-Experts. It introduces an unbiased, importance-weighted ELBO and a geometric loss to enforce meaningful pose geometry, supported by two modality encoders, a fusion module, and a self-attention-enabled pose regressor to predict 6-DoF pose. Empirical results on indoor 7-Scenes and outdoor Oxford RobotCar demonstrate superior accuracy and robustness over RGB-only baselines and other multimodal variants, including under input degradation. The work advances practical multimodal localization for robotics and autonomous systems, with code available at the project site.

Abstract

Recent learning-based approaches have achieved impressive results in the field of single-shot camera localization. However, how best to fuse multiple modalities (e.g., image and depth) and to deal with degraded or missing input are less well studied. In particular, we note that previous approaches towards deep fusion do not perform significantly better than models employing a single modality. We conjecture that this is because of the naive approaches to feature space fusion through summation or concatenation which do not take into account the different strengths of each modality. To address this, we propose an end-to-end framework, termed VMLoc, to fuse different sensor inputs into a common latent space through a variational Product-of-Experts (PoE) followed by attention-based fusion. Unlike previous multimodal variational works directly adapting the objective function of vanilla variational auto-encoder, we show how camera localization can be accurately estimated through an unbiased objective function based on importance weighting. Our model is extensively evaluated on RGB-D datasets and the results prove the efficacy of our model. The source code is available at https://github.com/kaichen-z/VMLoc.

Paper Structure

This paper contains 28 sections, 18 equations, 4 figures, 4 tables, 1 algorithm.

Figures (4)

  • Figure 1: (a) Training process of MVAE. (b) Training process of VMLoc. (c) Inference process of VMLoc.
  • Figure 2: Our framework consists of feature encoders, a fusion module, an attention mechanism module and a pose regressor.
  • Figure 3: The generated trajectories of LOOP1 (Top) and FULL1 (Bottom) with proposed VMLoc (b) and the baselines MapNet (a). The yellow star denotes the starting point. The ground truth trajectories are shown in black lines, while the red lines are the predicted trajectories.
  • Figure 4: Input images and input projections of lidar with different levels of corruption.