Table of Contents
Fetching ...

CVAM-Pose: Conditional Variational Autoencoder for Multi-Object Monocular Pose Estimation

Jianyu Zhao, Wei Quan, Bogdan J. Matuszewski

TL;DR

The CVAM-Pose method employs a label-embedded conditional variational autoencoder network, to implicitly abstract regularised representations of multiple objects in a single low-dimensional latent space, which outperforms competing latent space approaches.

Abstract

Estimating rigid objects' poses is one of the fundamental problems in computer vision, with a range of applications across automation and augmented reality. Most existing approaches adopt one network per object class strategy, depend heavily on objects' 3D models, depth data, and employ a time-consuming iterative refinement, which could be impractical for some applications. This paper presents a novel approach, CVAM-Pose, for multi-object monocular pose estimation that addresses these limitations. The CVAM-Pose method employs a label-embedded conditional variational autoencoder network, to implicitly abstract regularised representations of multiple objects in a single low-dimensional latent space. This autoencoding process uses only images captured by a projective camera and is robust to objects' occlusion and scene clutter. The classes of objects are one-hot encoded and embedded throughout the network. The proposed label-embedded pose regression strategy interprets the learnt latent space representations utilising continuous pose representations. Ablation tests and systematic evaluations demonstrate the scalability and efficiency of the CVAM-Pose method for multi-object scenarios. The proposed CVAM-Pose outperforms competing latent space approaches. For example, it is respectively 25% and 20% better than AAE and Multi-Path methods, when evaluated using the $\mathrm{AR_{VSD}}$ metric on the Linemod-Occluded dataset. It also achieves results somewhat comparable to methods reliant on 3D models reported in BOP challenges. Code available: https://github.com/JZhao12/CVAM-Pose

CVAM-Pose: Conditional Variational Autoencoder for Multi-Object Monocular Pose Estimation

TL;DR

The CVAM-Pose method employs a label-embedded conditional variational autoencoder network, to implicitly abstract regularised representations of multiple objects in a single low-dimensional latent space, which outperforms competing latent space approaches.

Abstract

Estimating rigid objects' poses is one of the fundamental problems in computer vision, with a range of applications across automation and augmented reality. Most existing approaches adopt one network per object class strategy, depend heavily on objects' 3D models, depth data, and employ a time-consuming iterative refinement, which could be impractical for some applications. This paper presents a novel approach, CVAM-Pose, for multi-object monocular pose estimation that addresses these limitations. The CVAM-Pose method employs a label-embedded conditional variational autoencoder network, to implicitly abstract regularised representations of multiple objects in a single low-dimensional latent space. This autoencoding process uses only images captured by a projective camera and is robust to objects' occlusion and scene clutter. The classes of objects are one-hot encoded and embedded throughout the network. The proposed label-embedded pose regression strategy interprets the learnt latent space representations utilising continuous pose representations. Ablation tests and systematic evaluations demonstrate the scalability and efficiency of the CVAM-Pose method for multi-object scenarios. The proposed CVAM-Pose outperforms competing latent space approaches. For example, it is respectively 25% and 20% better than AAE and Multi-Path methods, when evaluated using the metric on the Linemod-Occluded dataset. It also achieves results somewhat comparable to methods reliant on 3D models reported in BOP challenges. Code available: https://github.com/JZhao12/CVAM-Pose

Paper Structure

This paper contains 15 sections, 2 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: During training, the label-embedded CVAE network abstracts information from both images of objects and the corresponding categorical labels in the latent space, which are then interpolated to multi-object 6-DoF poses using MLPs. The images of objects are taken from the Linemod PBR dataset hinterstoisser2013modelhodavn2019photorealistichodavn2020boppbrdata.
  • Figure 2: The proposed label-embedded conditional variational autoencoder network. The images of objects are taken from the Linemod PBR dataset hinterstoisser2013modelhodavn2019photorealistichodavn2020boppbrdata.
  • Figure 3: The output images from decoder show objects' representations with occlusion and clutter removed. The test input images also shown are taken from the Linemod-Occluded dataset brachmann2014learninglmodata.
  • Figure 4: The proposed label-embedded pose regression approach interpolates multi-object representations to continuous pose representations using multiple MLP heads.
  • Figure 5: Box plots of the MSPD metric as a function of the objects' visibility rates. The number of data instances for each rate is shown above each box. Please note that for better visualisation, the MSPD metric is calculated using thresholds ranging from $1$ to $50$ with a step of $1$, instead of using the thresholds (from $5$ to $50$ with a step of $5$) defined in the BOP challenges.
  • ...and 2 more figures