Table of Contents
Fetching ...

CLIPose: Category-Level Object Pose Estimation with Pre-trained Vision-Language Knowledge

Xiao Lin, Minghao Zhu, Ronghao Dang, Guangliang Zhou, Shaolong Shu, Feng Lin, Chengju Liu, Qijun Chen

TL;DR

CLIPose tackles the data scarcity in category-level 6D object pose estimation by harnessing pre-trained vision-language knowledge from CLIP. It learns robust category-specific features by aligning point cloud, image, and pose-informed text representations through multi-modal contrastive learning, while carefully tuning the CLIP image encoder with prompt tokens to capture pose sensitivity. The method uses a GPV-Pose–style pose estimator with a combined loss that includes rotation, translation, scale, and symmetry terms, achieving state-of-the-art results on REAL275 and CAMERA25 and running in real time. By relying on cross-modal semantic knowledge and a lightweight prompt-tuning strategy, CLIPose reduces dependence on extensive 3D shape priors and demonstrates strong open-world pose estimation potential with tangible practical impact.

Abstract

Most of existing category-level object pose estimation methods devote to learning the object category information from point cloud modality. However, the scale of 3D datasets is limited due to the high cost of 3D data collection and annotation. Consequently, the category features extracted from these limited point cloud samples may not be comprehensive. This motivates us to investigate whether we can draw on knowledge of other modalities to obtain category information. Inspired by this motivation, we propose CLIPose, a novel 6D pose framework that employs the pre-trained vision-language model to develop better learning of object category information, which can fully leverage abundant semantic knowledge in image and text modalities. To make the 3D encoder learn category-specific features more efficiently, we align representations of three modalities in feature space via multi-modal contrastive learning. In addition to exploiting the pre-trained knowledge of the CLIP's model, we also expect it to be more sensitive with pose parameters. Therefore, we introduce a prompt tuning approach to fine-tune image encoder while we incorporate rotations and translations information in the text descriptions. CLIPose achieves state-of-the-art performance on two mainstream benchmark datasets, REAL275 and CAMERA25, and runs in real-time during inference (40FPS).

CLIPose: Category-Level Object Pose Estimation with Pre-trained Vision-Language Knowledge

TL;DR

CLIPose tackles the data scarcity in category-level 6D object pose estimation by harnessing pre-trained vision-language knowledge from CLIP. It learns robust category-specific features by aligning point cloud, image, and pose-informed text representations through multi-modal contrastive learning, while carefully tuning the CLIP image encoder with prompt tokens to capture pose sensitivity. The method uses a GPV-Pose–style pose estimator with a combined loss that includes rotation, translation, scale, and symmetry terms, achieving state-of-the-art results on REAL275 and CAMERA25 and running in real time. By relying on cross-modal semantic knowledge and a lightweight prompt-tuning strategy, CLIPose reduces dependence on extensive 3D shape priors and demonstrates strong open-world pose estimation potential with tangible practical impact.

Abstract

Most of existing category-level object pose estimation methods devote to learning the object category information from point cloud modality. However, the scale of 3D datasets is limited due to the high cost of 3D data collection and annotation. Consequently, the category features extracted from these limited point cloud samples may not be comprehensive. This motivates us to investigate whether we can draw on knowledge of other modalities to obtain category information. Inspired by this motivation, we propose CLIPose, a novel 6D pose framework that employs the pre-trained vision-language model to develop better learning of object category information, which can fully leverage abundant semantic knowledge in image and text modalities. To make the 3D encoder learn category-specific features more efficiently, we align representations of three modalities in feature space via multi-modal contrastive learning. In addition to exploiting the pre-trained knowledge of the CLIP's model, we also expect it to be more sensitive with pose parameters. Therefore, we introduce a prompt tuning approach to fine-tune image encoder while we incorporate rotations and translations information in the text descriptions. CLIPose achieves state-of-the-art performance on two mainstream benchmark datasets, REAL275 and CAMERA25, and runs in real-time during inference (40FPS).
Paper Structure (13 sections, 16 equations, 5 figures, 9 tables)

This paper contains 13 sections, 16 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Illustration of CLIPose. The proposed CLIPose takes as input a triplet of objects (point cloud, image and text), and aligns the representations of three modalities in feature space via contrastive learning. This enables the network to obtain more robust category-specific information from vision-language pre-trained knowledge.
  • Figure 2: Overview of CLIPose. The inputs of our framework are a batch of objects (e.g. laptop) represented as triplets (point cloud, image, text), therein we append rotation and translation information to the text description. Image and text features are extracted from a pre-trained (Frozen) vision and language model, and point cloud features are obtained by a 3D encoder (Tuning). Contrastive losses are applied to align the 3D representations of an object to its image and text representations during pre-training. We utilize CLIP's inherent classification ability to predict the category of the input object (m is the number of categories) to form the classification loss. The image encoder would be fine-tuned with additional prompt tokens $P_0$ (Bottom Right).
  • Figure 3: Comparison of the loss function in CE and NCE. The darker the color means the higher similarity value. $m$ denotes the number of predefined categories. $n$ indicates the number of input samples. The ground truth list [···] for each input batch could be formed into a similarity ground truth matrix using one-hot encoding.
  • Figure 4: Qualitative results of the IST-Net (red line) and our method (green line). The ground truth results are shown with white lines. The detail comparison areas are annotated with yellow lines.
  • Figure 5: Visualization of ablation studies on prompt length. The model exhibits improved performance when the length is kept at 50 or below. As the length increases, the outcomes tend to decline.