Table of Contents
Fetching ...

COM3D: Leveraging Cross-View Correspondence and Cross-Modal Mining for 3D Retrieval

Hao Wu, Ruochong LI, Hao Wang, Hui Xiong

TL;DR

This work tackles text–3D retrieval by enriching 3D representations with cross-view information through a Scene Representation Transformer (SRT) and fusing them with textual signals in a shared embedding space. A cross-modal joint encoder combines PointNet++-based 3D features, SRT-derived multi-view features, and a Bi-GRU text encoder, enabling robust cross-modal matching. The retrieval objective blends Earth Mover’s Distance with cosine similarity, augmented by semi-hard negative mining to improve learning efficiency and discriminative power. Experiments on a 3D-Text cross-modal dataset demonstrate state-of-the-art performance, validating the value of cross-view and cross-modal mining for accurate and efficient 3D retrieval, with potential extensions to mesh-based representations in the future.

Abstract

In this paper, we investigate an open research task of cross-modal retrieval between 3D shapes and textual descriptions. Previous approaches mainly rely on point cloud encoders for feature extraction, which may ignore key inherent features of 3D shapes, including depth, spatial hierarchy, geometric continuity, etc. To address this issue, we propose COM3D, making the first attempt to exploit the cross-view correspondence and cross-modal mining to enhance the retrieval performance. Notably, we augment the 3D features through a scene representation transformer, to generate cross-view correspondence features of 3D shapes, which enrich the inherent features and enhance their compatibility with text matching. Furthermore, we propose to optimize the cross-modal matching process based on the semi-hard negative example mining method, in an attempt to improve the learning efficiency. Extensive quantitative and qualitative experiments demonstrate the superiority of our proposed COM3D, achieving state-of-the-art results on the Text2Shape dataset.

COM3D: Leveraging Cross-View Correspondence and Cross-Modal Mining for 3D Retrieval

TL;DR

This work tackles text–3D retrieval by enriching 3D representations with cross-view information through a Scene Representation Transformer (SRT) and fusing them with textual signals in a shared embedding space. A cross-modal joint encoder combines PointNet++-based 3D features, SRT-derived multi-view features, and a Bi-GRU text encoder, enabling robust cross-modal matching. The retrieval objective blends Earth Mover’s Distance with cosine similarity, augmented by semi-hard negative mining to improve learning efficiency and discriminative power. Experiments on a 3D-Text cross-modal dataset demonstrate state-of-the-art performance, validating the value of cross-view and cross-modal mining for accurate and efficient 3D retrieval, with potential extensions to mesh-based representations in the future.

Abstract

In this paper, we investigate an open research task of cross-modal retrieval between 3D shapes and textual descriptions. Previous approaches mainly rely on point cloud encoders for feature extraction, which may ignore key inherent features of 3D shapes, including depth, spatial hierarchy, geometric continuity, etc. To address this issue, we propose COM3D, making the first attempt to exploit the cross-view correspondence and cross-modal mining to enhance the retrieval performance. Notably, we augment the 3D features through a scene representation transformer, to generate cross-view correspondence features of 3D shapes, which enrich the inherent features and enhance their compatibility with text matching. Furthermore, we propose to optimize the cross-modal matching process based on the semi-hard negative example mining method, in an attempt to improve the learning efficiency. Extensive quantitative and qualitative experiments demonstrate the superiority of our proposed COM3D, achieving state-of-the-art results on the Text2Shape dataset.
Paper Structure (27 sections, 7 equations, 4 figures, 2 tables)

This paper contains 27 sections, 7 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: An overview of our proposed COM3D. It is composed of three main embeddings: multi-view embedding, shape embedding, and text embedding. The multi-view embedding is derived from multi-view images using a Scene Representation Transformer (SRT) Encoder, considering the corresponding camera poses and rays; Shape embedding is extracted by PointNet++. Semi-Hard Negative Mining enhances the matching of text and 3D by focusing on moderately challenging samples within the semi-hard range, marked by orange lines, with anchor depicted by red borders.
  • Figure 2: text-to-3D retrieval results by our COM3D. For each query sentence, we show the top-5 ranked shape.
  • Figure 3: text-to-3D retrieval results by our COM3D. For each query sentence, we show the top-5 ranked shape.
  • Figure 4: 3D-to-text retrieval results by our COM3D. For each query sentence, we show the top-5 ranked shape.