Enhanced Cross-modal 3D Retrieval via Tri-modal Reconstruction

Junlong Ren; Hao Wang

Enhanced Cross-modal 3D Retrieval via Tri-modal Reconstruction

Junlong Ren, Hao Wang

TL;DR

This work tackles bi-directional cross-modal 3D retrieval by bridging 2D, 3D, and text modalities. It proposes a tri-modal framework with dedicated encoders for point clouds, multi-view images, and text, coupled with tri-modal reconstruction to enhance encoder generalization. The approach uses fine-grained 2D-3D fusion via context-query attention and hard negative contrastive training to handle dataset noise, achieving state-of-the-art results on Text2Shape for both shape-to-text and text-to-shape tasks. Overall, the method demonstrates the value of exploiting tri-modal data and 2D-3D consistency to improve geometric and semantic understanding in cross-modal retrieval.

Abstract

Cross-modal 3D retrieval is a critical yet challenging task, aiming to achieve bi-directional retrieval between 3D and text modalities. Current methods predominantly rely on a certain 3D representation (e.g., point cloud), with few exploiting the 2D-3D consistency and complementary relationships, which constrains their performance. To bridge this gap, we propose to adopt multi-view images and point clouds to jointly represent 3D shapes, facilitating tri-modal alignment (i.e., image, point, text) for enhanced cross-modal 3D retrieval. Notably, we introduce tri-modal reconstruction to improve the generalization ability of encoders. Given point features, we reconstruct image features under the guidance of text features, and vice versa. With well-aligned point cloud and multi-view image features, we aggregate them as multimodal embeddings through fine-grained 2D-3D fusion to enhance geometric and semantic understanding. Recognizing the significant noise in current datasets where many 3D shapes and texts share similar semantics, we employ hard negative contrastive training to emphasize harder negatives with greater significance, leading to robust discriminative embeddings. Extensive experiments on the Text2Shape dataset demonstrate that our method significantly outperforms previous state-of-the-art methods in both shape-to-text and text-to-shape retrieval tasks by a substantial margin.

Enhanced Cross-modal 3D Retrieval via Tri-modal Reconstruction

TL;DR

Abstract

Enhanced Cross-modal 3D Retrieval via Tri-modal Reconstruction

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)