Table of Contents
Fetching ...

Enhanced Cross-modal 3D Retrieval via Tri-modal Reconstruction

Junlong Ren, Hao Wang

TL;DR

This work tackles bi-directional cross-modal 3D retrieval by bridging 2D, 3D, and text modalities. It proposes a tri-modal framework with dedicated encoders for point clouds, multi-view images, and text, coupled with tri-modal reconstruction to enhance encoder generalization. The approach uses fine-grained 2D-3D fusion via context-query attention and hard negative contrastive training to handle dataset noise, achieving state-of-the-art results on Text2Shape for both shape-to-text and text-to-shape tasks. Overall, the method demonstrates the value of exploiting tri-modal data and 2D-3D consistency to improve geometric and semantic understanding in cross-modal retrieval.

Abstract

Cross-modal 3D retrieval is a critical yet challenging task, aiming to achieve bi-directional retrieval between 3D and text modalities. Current methods predominantly rely on a certain 3D representation (e.g., point cloud), with few exploiting the 2D-3D consistency and complementary relationships, which constrains their performance. To bridge this gap, we propose to adopt multi-view images and point clouds to jointly represent 3D shapes, facilitating tri-modal alignment (i.e., image, point, text) for enhanced cross-modal 3D retrieval. Notably, we introduce tri-modal reconstruction to improve the generalization ability of encoders. Given point features, we reconstruct image features under the guidance of text features, and vice versa. With well-aligned point cloud and multi-view image features, we aggregate them as multimodal embeddings through fine-grained 2D-3D fusion to enhance geometric and semantic understanding. Recognizing the significant noise in current datasets where many 3D shapes and texts share similar semantics, we employ hard negative contrastive training to emphasize harder negatives with greater significance, leading to robust discriminative embeddings. Extensive experiments on the Text2Shape dataset demonstrate that our method significantly outperforms previous state-of-the-art methods in both shape-to-text and text-to-shape retrieval tasks by a substantial margin.

Enhanced Cross-modal 3D Retrieval via Tri-modal Reconstruction

TL;DR

This work tackles bi-directional cross-modal 3D retrieval by bridging 2D, 3D, and text modalities. It proposes a tri-modal framework with dedicated encoders for point clouds, multi-view images, and text, coupled with tri-modal reconstruction to enhance encoder generalization. The approach uses fine-grained 2D-3D fusion via context-query attention and hard negative contrastive training to handle dataset noise, achieving state-of-the-art results on Text2Shape for both shape-to-text and text-to-shape tasks. Overall, the method demonstrates the value of exploiting tri-modal data and 2D-3D consistency to improve geometric and semantic understanding in cross-modal retrieval.

Abstract

Cross-modal 3D retrieval is a critical yet challenging task, aiming to achieve bi-directional retrieval between 3D and text modalities. Current methods predominantly rely on a certain 3D representation (e.g., point cloud), with few exploiting the 2D-3D consistency and complementary relationships, which constrains their performance. To bridge this gap, we propose to adopt multi-view images and point clouds to jointly represent 3D shapes, facilitating tri-modal alignment (i.e., image, point, text) for enhanced cross-modal 3D retrieval. Notably, we introduce tri-modal reconstruction to improve the generalization ability of encoders. Given point features, we reconstruct image features under the guidance of text features, and vice versa. With well-aligned point cloud and multi-view image features, we aggregate them as multimodal embeddings through fine-grained 2D-3D fusion to enhance geometric and semantic understanding. Recognizing the significant noise in current datasets where many 3D shapes and texts share similar semantics, we employ hard negative contrastive training to emphasize harder negatives with greater significance, leading to robust discriminative embeddings. Extensive experiments on the Text2Shape dataset demonstrate that our method significantly outperforms previous state-of-the-art methods in both shape-to-text and text-to-shape retrieval tasks by a substantial margin.

Paper Structure

This paper contains 30 sections, 9 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Comparison with previous methods on shape-to-text and text-to-shape retrieval. We outperform these works by a large margin over all metrics on the Text2Shape dataset chen2019text2shape.
  • Figure 2: The overview of our proposed method. It consists of three components: frozen encoders with trainable adapters for three modalities, tri-modal reconstruction, and fine-grained 2D-3D fusion. Each 3D shape is represented as a point cloud and multi-view images to utilize 2D-3D consistency and complementary relationships. Tri-modal reconstruction aims to reconstruct image features with point features under the guidance of text features, and vice versa. Fine-grained 2D-3D fusion aggregates point and image features to holistically represent 3D shapes. Hard negative contrastive training re-weights harder negatives with higher importance to learn and align discriminative embeddings.
  • Figure 3: The pipeline of tri-modal reconstruction. We reconstruct point embeddings using image and text embeddings and simultaneously reconstruct image embeddings with point and text embeddings.
  • Figure 4: Ablation study on view numbers with RR@1.
  • Figure 5: Shape-to-text retrieval results. Each query shape is displayed with the top-5-ranked texts. Ground truths are highlighted in red.
  • ...and 2 more figures