A Novel Convolution and Attention Mechanism-based Model for 6D Object Pose Estimation
Alexander Du, Xiujin Liu
TL;DR
PoseLecTr presents a graph-based encoder-decoder for 6D object pose estimation from monocular RGB images, addressing limitations of grid-based CNNs by explicitly modeling spatial relations with a graph and introducing Legendre convolution for numerical stability. The method integrates spatial-attention and self-attention distillation to refine feature selection, and uses a graph Laplacian framework to capture both local and global dependencies. On LINEMOD, Occluded LINEMOD, and YCB-Video, PoseLecTr demonstrates state-competitive performance, with ablations showing the Legendre kernel and attention components as key contributors to robustness in cluttered and occluded scenes. The work advances practical 6D pose estimation by combining a graph-based representation with mathematically stable convolution and targeted attention mechanisms, improving accuracy for texture-less, symmetric, and occluded objects in realistic settings.
Abstract
This paper proposes PoseLecTr, a graph-based encoder-decoder framework that integrates a novel Legendre convolution with attention mechanisms for six-degree-of-freedom (6-DOF) object pose estimation from monocular RGB images. Conventional learning-based approaches predominantly rely on grid-structured convolutions, which can limit their ability to model higher-order and long-range dependencies among image features, especially in cluttered or occluded scenes. PoseLecTr addresses this limitation by constructing a graph representation from image features, where spatial relationships are explicitly modeled through graph connectivity. The proposed framework incorporates a Legendre convolution layer to improve numerical stability in graph convolution, together with spatial-attention and self-attention distillation to enhance feature selection. Experiments conducted on the LINEMOD, Occluded LINEMOD, and YCB-VIDEO datasets demonstrate that our method achieves competitive performance and shows consistent improvements across a wide range of objects and scene complexities.
