Multi-modal Relation Distillation for Unified 3D Representation Learning
Huiqun Wang, Yiping Bao, Panwang Pan, Zeming Li, Xiao Liu, Ruijie Yang, Di Huang
TL;DR
This work tackles limited 3D data by leveraging rich relational priors from large Vision-Language Models to distill intra- and cross-modal relations into a 3D backbone. It introduces Multi-modal Relation Distillation (MRD), a framework that models relational structure within and across image, text, and 3D modalities and transfers this knowledge from the CLIP space to 3D representations via dynamic distillation with learnable weights. MRD employs two main relational representations (favoring normalized similarity with Jeffrey divergence) and three distillation losses to align the 3D space with the pre-aligned image-text embeddings, achieving state-of-the-art zero-shot classification and cross-modal retrieval on diverse benchmarks. The approach scales with larger backbones, demonstrates strong ablation results, and charts a path toward robust, unified multi-modal 3D representation learning with practical impact on 3D understanding tasks.
Abstract
Recent advancements in multi-modal pre-training for 3D point clouds have demonstrated promising results by aligning heterogeneous features across 3D shapes and their corresponding 2D images and language descriptions. However, current straightforward solutions often overlook intricate structural relations among samples, potentially limiting the full capabilities of multi-modal learning. To address this issue, we introduce Multi-modal Relation Distillation (MRD), a tri-modal pre-training framework, which is designed to effectively distill reputable large Vision-Language Models (VLM) into 3D backbones. MRD aims to capture both intra-relations within each modality as well as cross-relations between different modalities and produce more discriminative 3D shape representations. Notably, MRD achieves significant improvements in downstream zero-shot classification tasks and cross-modality retrieval tasks, delivering new state-of-the-art performance.
