Table of Contents
Fetching ...

A Novel Convolution and Attention Mechanism-based Model for 6D Object Pose Estimation

Alexander Du, Xiujin Liu

TL;DR

PoseLecTr presents a graph-based encoder-decoder for 6D object pose estimation from monocular RGB images, addressing limitations of grid-based CNNs by explicitly modeling spatial relations with a graph and introducing Legendre convolution for numerical stability. The method integrates spatial-attention and self-attention distillation to refine feature selection, and uses a graph Laplacian framework to capture both local and global dependencies. On LINEMOD, Occluded LINEMOD, and YCB-Video, PoseLecTr demonstrates state-competitive performance, with ablations showing the Legendre kernel and attention components as key contributors to robustness in cluttered and occluded scenes. The work advances practical 6D pose estimation by combining a graph-based representation with mathematically stable convolution and targeted attention mechanisms, improving accuracy for texture-less, symmetric, and occluded objects in realistic settings.

Abstract

This paper proposes PoseLecTr, a graph-based encoder-decoder framework that integrates a novel Legendre convolution with attention mechanisms for six-degree-of-freedom (6-DOF) object pose estimation from monocular RGB images. Conventional learning-based approaches predominantly rely on grid-structured convolutions, which can limit their ability to model higher-order and long-range dependencies among image features, especially in cluttered or occluded scenes. PoseLecTr addresses this limitation by constructing a graph representation from image features, where spatial relationships are explicitly modeled through graph connectivity. The proposed framework incorporates a Legendre convolution layer to improve numerical stability in graph convolution, together with spatial-attention and self-attention distillation to enhance feature selection. Experiments conducted on the LINEMOD, Occluded LINEMOD, and YCB-VIDEO datasets demonstrate that our method achieves competitive performance and shows consistent improvements across a wide range of objects and scene complexities.

A Novel Convolution and Attention Mechanism-based Model for 6D Object Pose Estimation

TL;DR

PoseLecTr presents a graph-based encoder-decoder for 6D object pose estimation from monocular RGB images, addressing limitations of grid-based CNNs by explicitly modeling spatial relations with a graph and introducing Legendre convolution for numerical stability. The method integrates spatial-attention and self-attention distillation to refine feature selection, and uses a graph Laplacian framework to capture both local and global dependencies. On LINEMOD, Occluded LINEMOD, and YCB-Video, PoseLecTr demonstrates state-competitive performance, with ablations showing the Legendre kernel and attention components as key contributors to robustness in cluttered and occluded scenes. The work advances practical 6D pose estimation by combining a graph-based representation with mathematically stable convolution and targeted attention mechanisms, improving accuracy for texture-less, symmetric, and occluded objects in realistic settings.

Abstract

This paper proposes PoseLecTr, a graph-based encoder-decoder framework that integrates a novel Legendre convolution with attention mechanisms for six-degree-of-freedom (6-DOF) object pose estimation from monocular RGB images. Conventional learning-based approaches predominantly rely on grid-structured convolutions, which can limit their ability to model higher-order and long-range dependencies among image features, especially in cluttered or occluded scenes. PoseLecTr addresses this limitation by constructing a graph representation from image features, where spatial relationships are explicitly modeled through graph connectivity. The proposed framework incorporates a Legendre convolution layer to improve numerical stability in graph convolution, together with spatial-attention and self-attention distillation to enhance feature selection. Experiments conducted on the LINEMOD, Occluded LINEMOD, and YCB-VIDEO datasets demonstrate that our method achieves competitive performance and shows consistent improvements across a wide range of objects and scene complexities.
Paper Structure (29 sections, 11 equations, 2 figures, 3 tables)

This paper contains 29 sections, 11 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: The architecture of PoseLecTr
  • Figure 2: Visualization of Ablation Studies