Table of Contents
Fetching ...

Contrastive masked auto-encoders based self-supervised hashing for 2D image and 3D point cloud cross-modal retrieval

Rukai Wei, Heng Cui, Yu Liu, Yufeng Hou, Yanzhao Xie, Ke Zhou

TL;DR

This work tackles the problem of cross-modal retrieval between 2D images and 3D point-cloud data by introducing CMAH, a self-supervised hashing framework. CMAH leverages contrastive learning to align modalities in a joint Hamming space and masked auto-encoders with a multi-modal fusion block to capture both global semantics and local 2D–3D cues, optimizing with a combined loss $L_{overall}=L_c+L_r$ where $L_c$ fuses full/masked cross-modal pairs and $L_r$ reconstructs masked content via $L_{3D}$ and $L_{2D}$. The method achieves state-of-the-art performance on ShapeNetRender, ModelNet, and ShapeNet-55 across multiple code lengths, demonstrating strong modality-gap reduction and robust local perception. This approach promises practical benefits for scalable multimedia retrieval systems by delivering accurate cross-modal results with compact binary representations $b^P,b^I\in\{-1,1\}^K$.

Abstract

Implementing cross-modal hashing between 2D images and 3D point-cloud data is a growing concern in real-world retrieval systems. Simply applying existing cross-modal approaches to this new task fails to adequately capture latent multi-modal semantics and effectively bridge the modality gap between 2D and 3D. To address these issues without relying on hand-crafted labels, we propose contrastive masked autoencoders based self-supervised hashing (CMAH) for retrieval between images and point-cloud data. We start by contrasting 2D-3D pairs and explicitly constraining them into a joint Hamming space. This contrastive learning process ensures robust discriminability for the generated hash codes and effectively reduces the modality gap. Moreover, we utilize multi-modal auto-encoders to enhance the model's understanding of multi-modal semantics. By completing the masked image/point-cloud data modeling task, the model is encouraged to capture more localized clues. In addition, the proposed multi-modal fusion block facilitates fine-grained interactions among different modalities. Extensive experiments on three public datasets demonstrate that the proposed CMAH significantly outperforms all baseline methods.

Contrastive masked auto-encoders based self-supervised hashing for 2D image and 3D point cloud cross-modal retrieval

TL;DR

This work tackles the problem of cross-modal retrieval between 2D images and 3D point-cloud data by introducing CMAH, a self-supervised hashing framework. CMAH leverages contrastive learning to align modalities in a joint Hamming space and masked auto-encoders with a multi-modal fusion block to capture both global semantics and local 2D–3D cues, optimizing with a combined loss where fuses full/masked cross-modal pairs and reconstructs masked content via and . The method achieves state-of-the-art performance on ShapeNetRender, ModelNet, and ShapeNet-55 across multiple code lengths, demonstrating strong modality-gap reduction and robust local perception. This approach promises practical benefits for scalable multimedia retrieval systems by delivering accurate cross-modal results with compact binary representations .

Abstract

Implementing cross-modal hashing between 2D images and 3D point-cloud data is a growing concern in real-world retrieval systems. Simply applying existing cross-modal approaches to this new task fails to adequately capture latent multi-modal semantics and effectively bridge the modality gap between 2D and 3D. To address these issues without relying on hand-crafted labels, we propose contrastive masked autoencoders based self-supervised hashing (CMAH) for retrieval between images and point-cloud data. We start by contrasting 2D-3D pairs and explicitly constraining them into a joint Hamming space. This contrastive learning process ensures robust discriminability for the generated hash codes and effectively reduces the modality gap. Moreover, we utilize multi-modal auto-encoders to enhance the model's understanding of multi-modal semantics. By completing the masked image/point-cloud data modeling task, the model is encouraged to capture more localized clues. In addition, the proposed multi-modal fusion block facilitates fine-grained interactions among different modalities. Extensive experiments on three public datasets demonstrate that the proposed CMAH significantly outperforms all baseline methods.
Paper Structure (14 sections, 9 equations, 3 figures, 3 tables)

This paper contains 14 sections, 9 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: The overall framework of CMAH. On the one hand, both masked visible tokens and full tokens are encoded and projected as hash codes. Contrastive learning is conducted using both full-full and mask-full pairs. On the other hand, the encoded tokens are inputted into a multi-modal fusion block to facilitate fine-grained multi-modal interaction. Subsequently, they are directed to their respective decoders for 2D-pixel and 3D-coordinate reconstructions, respectively.
  • Figure 2: The Precision@Top-K curves on different datasets at 64 bits.
  • Figure 3: The t-SNE visualization for 64-bit hash codes before training and after training on ShapeNetRender dataset.