Table of Contents
Fetching ...

Advanced Object Detection and Pose Estimation with Hybrid Task Cascade and High-Resolution Networks

Yuhui Jin, Yaqiong Zhang, Zheyuan Xu, Wenqing Zhang, Jingyu Xu

TL;DR

This work tackles robust 6D object detection and pose estimation under occlusion and clutter by fusing Hybrid Task Cascade (HTC) with a High-Resolution Network (HRNet) backbone. The proposed HTC+HRNet pipeline enables multi-stage refinement while preserving high-resolution spatial details, complemented by targeted post-processing and model ensembling. Experimental results on public and private benchmarks demonstrate state-of-the-art improvements in both detection (mAP, IoU) and pose accuracy (translation and rotation errors), with practical implications for robotics, augmented reality, and autonomous driving. Overall, the approach delivers a robust, high-precision perception framework for demanding 6D perception tasks in real-world settings.

Abstract

In the field of computer vision, 6D object detection and pose estimation are critical for applications such as robotics, augmented reality, and autonomous driving. Traditional methods often struggle with achieving high accuracy in both object detection and precise pose estimation simultaneously. This study proposes an improved 6D object detection and pose estimation pipeline based on the existing 6D-VNet framework, enhanced by integrating a Hybrid Task Cascade (HTC) and a High-Resolution Network (HRNet) backbone. By leveraging the strengths of HTC's multi-stage refinement process and HRNet's ability to maintain high-resolution representations, our approach significantly improves detection accuracy and pose estimation precision. Furthermore, we introduce advanced post-processing techniques and a novel model integration strategy that collectively contribute to superior performance on public and private benchmarks. Our method demonstrates substantial improvements over state-of-the-art models, making it a valuable contribution to the domain of 6D object detection and pose estimation.

Advanced Object Detection and Pose Estimation with Hybrid Task Cascade and High-Resolution Networks

TL;DR

This work tackles robust 6D object detection and pose estimation under occlusion and clutter by fusing Hybrid Task Cascade (HTC) with a High-Resolution Network (HRNet) backbone. The proposed HTC+HRNet pipeline enables multi-stage refinement while preserving high-resolution spatial details, complemented by targeted post-processing and model ensembling. Experimental results on public and private benchmarks demonstrate state-of-the-art improvements in both detection (mAP, IoU) and pose accuracy (translation and rotation errors), with practical implications for robotics, augmented reality, and autonomous driving. Overall, the approach delivers a robust, high-precision perception framework for demanding 6D perception tasks in real-world settings.

Abstract

In the field of computer vision, 6D object detection and pose estimation are critical for applications such as robotics, augmented reality, and autonomous driving. Traditional methods often struggle with achieving high accuracy in both object detection and precise pose estimation simultaneously. This study proposes an improved 6D object detection and pose estimation pipeline based on the existing 6D-VNet framework, enhanced by integrating a Hybrid Task Cascade (HTC) and a High-Resolution Network (HRNet) backbone. By leveraging the strengths of HTC's multi-stage refinement process and HRNet's ability to maintain high-resolution representations, our approach significantly improves detection accuracy and pose estimation precision. Furthermore, we introduce advanced post-processing techniques and a novel model integration strategy that collectively contribute to superior performance on public and private benchmarks. Our method demonstrates substantial improvements over state-of-the-art models, making it a valuable contribution to the domain of 6D object detection and pose estimation.

Paper Structure

This paper contains 20 sections, 14 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: Enter Caption
  • Figure 2: Training metrics change with epoch.