Table of Contents
Fetching ...

Distilling 3D distinctive local descriptors for 6D pose estimation

Amir Hamza, Andrea Caraffa, Davide Boscaini, Fabio Poiesi

TL;DR

Zero-shot 6D pose estimation relies on powerful 3D local descriptors but faces practical constraints due to GeDi's slow inference. The authors introduce dGeDi, an object-oriented distillation framework that regresses GeDi descriptors with a fast PointTransformerV3-based student, guided by learning via correspondences from a frozen GeDi teacher. A novel loss that downweights unreliable supervision and a scalable training strategy enable large-scale synthetic data use while preserving discriminative descriptor quality, achieving substantial runtime reductions (over 170x faster) with competitive accuracy. This work moves zero-shot 6D pose estimation closer to real-time feasibility and opens avenues for efficient geometric reasoning in robotics.

Abstract

Three-dimensional local descriptors are crucial for encoding geometric surface properties, making them essential for various point cloud understanding tasks. Among these descriptors, GeDi has demonstrated strong zero-shot 6D pose estimation capabilities but remains computationally impractical for real-world applications due to its expensive inference process. Can we retain GeDi's effectiveness while significantly improving its efficiency? In this paper, we explore this question by introducing a knowledge distillation framework that trains an efficient student model to regress local descriptors from a GeDi teacher. Our key contributions include: an efficient large-scale training procedure that ensures robustness to occlusions and partial observations while operating under compute and storage constraints, and a novel loss formulation that handles weak supervision from non-distinctive teacher descriptors. We validate our approach on five BOP Benchmark datasets and demonstrate a significant reduction in inference time while maintaining competitive performance with existing methods, bringing zero-shot 6D pose estimation closer to real-time feasibility. Project Website: https://tev-fbk.github.io/dGeDi/

Distilling 3D distinctive local descriptors for 6D pose estimation

TL;DR

Zero-shot 6D pose estimation relies on powerful 3D local descriptors but faces practical constraints due to GeDi's slow inference. The authors introduce dGeDi, an object-oriented distillation framework that regresses GeDi descriptors with a fast PointTransformerV3-based student, guided by learning via correspondences from a frozen GeDi teacher. A novel loss that downweights unreliable supervision and a scalable training strategy enable large-scale synthetic data use while preserving discriminative descriptor quality, achieving substantial runtime reductions (over 170x faster) with competitive accuracy. This work moves zero-shot 6D pose estimation closer to real-time feasibility and opens avenues for efficient geometric reasoning in robotics.

Abstract

Three-dimensional local descriptors are crucial for encoding geometric surface properties, making them essential for various point cloud understanding tasks. Among these descriptors, GeDi has demonstrated strong zero-shot 6D pose estimation capabilities but remains computationally impractical for real-world applications due to its expensive inference process. Can we retain GeDi's effectiveness while significantly improving its efficiency? In this paper, we explore this question by introducing a knowledge distillation framework that trains an efficient student model to regress local descriptors from a GeDi teacher. Our key contributions include: an efficient large-scale training procedure that ensures robustness to occlusions and partial observations while operating under compute and storage constraints, and a novel loss formulation that handles weak supervision from non-distinctive teacher descriptors. We validate our approach on five BOP Benchmark datasets and demonstrate a significant reduction in inference time while maintaining competitive performance with existing methods, bringing zero-shot 6D pose estimation closer to real-time feasibility. Project Website: https://tev-fbk.github.io/dGeDi/

Paper Structure

This paper contains 14 sections, 2 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: We introduce dGeDi, a 3D point cloud encoder trained by distilling GeDi poiesi2023gedi features (Teacher, Top) into a PointTransformerV3 wu2024ptv3 (PTV3) backbone (Student, Bottom). GeDi suffers from slow inference as it processes points sequentially (multiple input arrows), first extracting local reference frames (LRF) and then computing descriptors with PointNet++ qi2017pn2 (PN++). Instead, dGeDi retains GeDi's generalization and distinctiveness while being over 170 times faster, making it ideal for real-time robotics applications.
  • Figure 2: Overview of dGeDi. Top: A Query point cloud is being fed to both teacher and student network. We obtain distinctive 3D local descriptor from (frozen) teacher which guide the student descriptors via proposed distillation loss. Bottom: Rather then computing the teacher features for Target objects, we leverage the learning via correspondences. We transfer the features from query to target point cloud using the ground-truth transformation available at training time. By leveraging this, we significantly reduce the memory and compute overhead.
  • Figure 3: Comparison between teacher (center) and student (right) features on a query object from T-LESS (left). Colors represent PCA-reduced features. Student features $\mathcal{G}^Q$ are smoother and less noisy than teacher features $\mathcal{F}^Q$.
  • Figure 4: Qualitative results on the BOP Benchmark (left) comparing dGeDi (center) with GeDi (right). We overlay the object's 3D model transformed according to the predicted pose on the greyscale input image for better contrast. Key challenges: occlusions (a, d, e), object symmetry (b), partial view (c), multiple instances (d).