Table of Contents
Fetching ...

VFM-VLM: Vision Foundation Model and Vision Language Model based Visual Comparison for 3D Pose Estimation

Md Selim Sarowar, Sungho Kim

TL;DR

This work systematically compares CLIP-based (semantic grounding) and DINOv2-based (dense geometric) architectures for 6D pose estimation in hand-object grasping. CLIP excels in semantic consistency and task understanding, while DINOv2 delivers superior geometric precision via dense feature correspondences and refinement. Quantitative results show DINOv2 achieving better geometric metrics (up to ~20% improvement in translation and rotation accuracy), with CLIP offering robust semantic guidance. The study highlights complementary strengths and motivates hybrid pipelines that fuse semantic grounding with geometric refinement for practical robotic manipulation.

Abstract

Vision Foundation Models (VFMs) and Vision Language Models (VLMs) have revolutionized computer vision by providing rich semantic and geometric representations. This paper presents a comprehensive visual comparison between CLIP based and DINOv2 based approaches for 3D pose estimation in hand object grasping scenarios. We evaluate both models on the task of 6D object pose estimation and demonstrate their complementary strengths: CLIP excels in semantic understanding through language grounding, while DINOv2 provides superior dense geometric features. Through extensive experiments on benchmark datasets, we show that CLIP based methods achieve better semantic consistency, while DINOv2 based approaches demonstrate competitive performance with enhanced geometric precision. Our analysis provides insights for selecting appropriate vision models for robotic manipulation and grasping, picking applications.

VFM-VLM: Vision Foundation Model and Vision Language Model based Visual Comparison for 3D Pose Estimation

TL;DR

This work systematically compares CLIP-based (semantic grounding) and DINOv2-based (dense geometric) architectures for 6D pose estimation in hand-object grasping. CLIP excels in semantic consistency and task understanding, while DINOv2 delivers superior geometric precision via dense feature correspondences and refinement. Quantitative results show DINOv2 achieving better geometric metrics (up to ~20% improvement in translation and rotation accuracy), with CLIP offering robust semantic guidance. The study highlights complementary strengths and motivates hybrid pipelines that fuse semantic grounding with geometric refinement for practical robotic manipulation.

Abstract

Vision Foundation Models (VFMs) and Vision Language Models (VLMs) have revolutionized computer vision by providing rich semantic and geometric representations. This paper presents a comprehensive visual comparison between CLIP based and DINOv2 based approaches for 3D pose estimation in hand object grasping scenarios. We evaluate both models on the task of 6D object pose estimation and demonstrate their complementary strengths: CLIP excels in semantic understanding through language grounding, while DINOv2 provides superior dense geometric features. Through extensive experiments on benchmark datasets, we show that CLIP based methods achieve better semantic consistency, while DINOv2 based approaches demonstrate competitive performance with enhanced geometric precision. Our analysis provides insights for selecting appropriate vision models for robotic manipulation and grasping, picking applications.

Paper Structure

This paper contains 21 sections, 2 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: CLIP Architecture: Dual-encoder framework with Vision Transformer (ViT-B/32) for image encoding and Text Transformer for language encoding. Both encoders project to a shared 512-dimensional embedding space where contrastive learning aligns matched image-text pairs.
  • Figure 2: DINOv2 Architecture: Self-supervised Vision Transformer (ViT-B/14) using student-teacher framework with self-distillation. The model produces dense patch-level features (768-dim) with strong spatial correspondence, ideal for geometric reasoning tasks.
  • Figure 3: CLIP based 6D pose estimation results on driller object. Top row shows RGB image, ground truth pose (green), predicted pose (red), and overlay comparison. Bottom row displays RGB point cloud, GT 3D box, predicted 3D box, and complete scene. Evaluation metrics: ADD Distance: 32.17mm, Rotation Error: 11.68°, Translation Error: 20.00mm.
  • Figure 4: DINOv2 based 3D pose estimation results showing multiple object detection and localization in a cluttered scene. Blue and green bounding boxes represent predicted poses for different objects, demonstrating DINOv2's capability for dense geometric feature extraction and simultaneous multi object pose estimation.
  • Figure 5: DINOv2 backbone based 3D scene reconstruction with point cloud visualization. The figure shows the complete scene representation with RGB point cloud and estimated 3D bounding boxes (red and green) overlaid on the detected objects. The spatial coordinates demonstrate accurate depth estimation and object localization in 3D space, with measurements in millimeters along X, Y, and Z axes.