Table of Contents
Fetching ...

Multi-modal Relation Distillation for Unified 3D Representation Learning

Huiqun Wang, Yiping Bao, Panwang Pan, Zeming Li, Xiao Liu, Ruijie Yang, Di Huang

TL;DR

This work tackles limited 3D data by leveraging rich relational priors from large Vision-Language Models to distill intra- and cross-modal relations into a 3D backbone. It introduces Multi-modal Relation Distillation (MRD), a framework that models relational structure within and across image, text, and 3D modalities and transfers this knowledge from the CLIP space to 3D representations via dynamic distillation with learnable weights. MRD employs two main relational representations (favoring normalized similarity with Jeffrey divergence) and three distillation losses to align the 3D space with the pre-aligned image-text embeddings, achieving state-of-the-art zero-shot classification and cross-modal retrieval on diverse benchmarks. The approach scales with larger backbones, demonstrates strong ablation results, and charts a path toward robust, unified multi-modal 3D representation learning with practical impact on 3D understanding tasks.

Abstract

Recent advancements in multi-modal pre-training for 3D point clouds have demonstrated promising results by aligning heterogeneous features across 3D shapes and their corresponding 2D images and language descriptions. However, current straightforward solutions often overlook intricate structural relations among samples, potentially limiting the full capabilities of multi-modal learning. To address this issue, we introduce Multi-modal Relation Distillation (MRD), a tri-modal pre-training framework, which is designed to effectively distill reputable large Vision-Language Models (VLM) into 3D backbones. MRD aims to capture both intra-relations within each modality as well as cross-relations between different modalities and produce more discriminative 3D shape representations. Notably, MRD achieves significant improvements in downstream zero-shot classification tasks and cross-modality retrieval tasks, delivering new state-of-the-art performance.

Multi-modal Relation Distillation for Unified 3D Representation Learning

TL;DR

This work tackles limited 3D data by leveraging rich relational priors from large Vision-Language Models to distill intra- and cross-modal relations into a 3D backbone. It introduces Multi-modal Relation Distillation (MRD), a framework that models relational structure within and across image, text, and 3D modalities and transfers this knowledge from the CLIP space to 3D representations via dynamic distillation with learnable weights. MRD employs two main relational representations (favoring normalized similarity with Jeffrey divergence) and three distillation losses to align the 3D space with the pre-aligned image-text embeddings, achieving state-of-the-art zero-shot classification and cross-modal retrieval on diverse benchmarks. The approach scales with larger backbones, demonstrates strong ablation results, and charts a path toward robust, unified multi-modal 3D representation learning with practical impact on 3D understanding tasks.

Abstract

Recent advancements in multi-modal pre-training for 3D point clouds have demonstrated promising results by aligning heterogeneous features across 3D shapes and their corresponding 2D images and language descriptions. However, current straightforward solutions often overlook intricate structural relations among samples, potentially limiting the full capabilities of multi-modal learning. To address this issue, we introduce Multi-modal Relation Distillation (MRD), a tri-modal pre-training framework, which is designed to effectively distill reputable large Vision-Language Models (VLM) into 3D backbones. MRD aims to capture both intra-relations within each modality as well as cross-relations between different modalities and produce more discriminative 3D shape representations. Notably, MRD achieves significant improvements in downstream zero-shot classification tasks and cross-modality retrieval tasks, delivering new state-of-the-art performance.
Paper Structure (17 sections, 13 equations, 8 figures, 8 tables)

This paper contains 17 sections, 13 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Illustration of Multi-modal Relation Distillation (MRD). (a) Conventional contrastive learning focuses on instance-level alignment but disrupts the intra-modality and cross-modality relations established in previous image-text alignment. For example, the nearby and distant relations between the three samples are disturbed in the 3D modality due to naive alignment. (b) MRD distills structural knowledge from both intra- and cross-modality mutual relations, aiming to preserve the semantic relations in the pre-aligned embedding spaces, thereby delivering more discriminative and coherent distributions. Zoom in for better view.
  • Figure 2: Comparison of various relation representation forms as well as corresponding distillation strategies. (a) Different embedding spaces. (b) Euclidean distance-based; (c) normalized similarity-based; and (d) partial order-based.
  • Figure 3: Overall framework of MRD. With the triplet input, image-text pairs are processed by the pre-trained CLIP model, while the accompanying point clouds are encoded by the 3D encoder. MRD captures the intra-modal mutual relations $\psi(\cdot)$ within each modality and the cross-modal mutual relations $\phi(\cdot)$ across each modality pair. It then dynamically distills and transfers structural information from the pre-aligned image-text space of CLIP into the 3D representations.
  • Figure 4: Visualization of the value changes of dynamic weights along with the progress of iterations on ShapeNet and Objaverse.
  • Figure 5: Comparison of model parameters and zero-shot accuracy (%) on Objaverse, where MRD achieves the highest parameter efficiency.
  • ...and 3 more figures