Sparse Color-Code Net: Real-Time RGB-Based 6D Object Pose Estimation on Edge Devices

Xingjian Yang; Zhitao Yu; Ashis G. Banerjee

Sparse Color-Code Net: Real-Time RGB-Based 6D Object Pose Estimation on Edge Devices

Xingjian Yang, Zhitao Yu, Ashis G. Banerjee

TL;DR

This work addresses real-time, RGB-based 6D object pose estimation on edge devices by introducing SCCN, a three-stage pipeline that leverages Sobel-contour features, sparse color-code regression, and a symmetry-aware representation to robustly handle occlusion and object symmetry. A key contribution is the anisotropic color-code and a novel per-pixel symmetry mask, enabling efficient, accurate 2D–3D correspondences followed by a PnP solver with a sparsified point set. The approach achieves real-time performance on NVIDIA Jetson Xavier (approximately 19 FPS for a single object and 6 FPS for multiple objects) with competitive accuracy, and ablation studies show the effectiveness of the anisotropic and symmetry components while preserving speed. The results demonstrate practical feasibility for mobile manipulation and AR, and the authors outline future work toward multi-instance poses, improved generalization, and integration with recognition and probabilistic mapping systems.

Abstract

As robotics and augmented reality applications increasingly rely on precise and efficient 6D object pose estimation, real-time performance on edge devices is required for more interactive and responsive systems. Our proposed Sparse Color-Code Net (SCCN) embodies a clear and concise pipeline design to effectively address this requirement. SCCN performs pixel-level predictions on the target object in the RGB image, utilizing the sparsity of essential object geometry features to speed up the Perspective-n-Point (PnP) computation process. Additionally, it introduces a novel pixel-level geometry-based object symmetry representation that seamlessly integrates with the initial pose predictions, effectively addressing symmetric object ambiguities. SCCN notably achieves an estimation rate of 19 frames per second (FPS) and 6 FPS on the benchmark LINEMOD dataset and the Occlusion LINEMOD dataset, respectively, for an NVIDIA Jetson AGX Xavier, while consistently maintaining high estimation accuracy at these rates.

Sparse Color-Code Net: Real-Time RGB-Based 6D Object Pose Estimation on Edge Devices

TL;DR

Abstract

Paper Structure (15 sections, 5 equations, 8 figures, 4 tables)

This paper contains 15 sections, 5 equations, 8 figures, 4 tables.

Introduction
RELATED WORK
METHODS
Object Segmentation
Optimal Mask Generation
Sparse Color Code Estimation
Symmetry Representation
Pose Estimation
EXPERIMENTS
Datasets
Evaluation Metrics
Implementation Details
Evaluation Results
Proof-of-concept Demonstration
CONCLUSIONS

Figures (8)

Figure 1: Left: Visualization of the color-code of an object. The object is normalized to a 1$\times$1$\times$1 cube, with its longest dimension spanning the full range. The X, Y, and Z axes map to the R, G, and B color channels, respectively, giving each surface point a unique color based on its position. Right: Illustration of the input RGB image, image contour captured by the Sobel filter, the color-code estimation of the target object, and the sparse color-code representation, which is the final output of the pipeline.
Figure 2: An overview of the Sparse Color-Code Net pipeline. It takes an input RGB image and applies Sobel filters to extract contours. The contours and RGB image are fed into a UNet to generate a coarse object mask. The mask determines a bounding box, which is used to crop, pad, and resize the RGB image, contour, and mask. This combination is the input to another UNet that estimates the color-code and symmetry mask. Finally, with the contour and sampling mask (optional), PnP estimation is used to determine the object's pose.
Figure 3: The Sobel filters used to extract the contour. Two sets of Sobel filter kernels: $3\times3$ (pad to $5\times5$) captures finer details and $5\times5$ captures more general and broader boundary information.
Figure 4: Visualization of the mask selection process. (a) is the input scene RGB image, (b) is the segmentation result which goes through softmax and comes out as probability map, (c) is the final masked area with bounding box. (d)(e)(f) are the masks derived by applying thresholds of 0.9, 0.7, 0.5 to the probability map. (g)(h)(i) are the max pooling map (by a factor of 8), yellow area is the original max pooling area, light green area is the overlapped area, dark green area is the expanded selected area.
Figure 5: Different color-code visualization. The standard color-code (b) normalizes the object to fit within a $1 \times 1 \times 1$ cube, with its maximum dimension spanning the full color channel range. The anisotropic color-code (c) allows each dimension to occupy the full range of its corresponding color channel. For objects with reflective symmetry, the symmetric anisotropic color-code (d) enables each symmetric part to span the full range of the respective color channel.
...and 3 more figures

Sparse Color-Code Net: Real-Time RGB-Based 6D Object Pose Estimation on Edge Devices

TL;DR

Abstract

Sparse Color-Code Net: Real-Time RGB-Based 6D Object Pose Estimation on Edge Devices

Authors

TL;DR

Abstract

Table of Contents

Figures (8)