JM3D & JM3D-LLM: Elevating 3D Understanding with Joint Multi-modal Cues

Jiayi Ji; Haowei Wang; Changli Wu; Yiwei Ma; Xiaoshuai Sun; Rongrong Ji

JM3D & JM3D-LLM: Elevating 3D Understanding with Joint Multi-modal Cues

Jiayi Ji, Haowei Wang, Changli Wu, Yiwei Ma, Xiaoshuai Sun, Rongrong Ji

TL;DR

The paper tackles the paucity of comprehensive 3D data and information loss when transferring 2D alignment strategies to 3D understanding. It introduces Structured Multimodal Organizer (SMO) to enrich visual and textual cues via Continuous Image Sequence (CIS) and Hierarchical Text Tree (HTT), and Joint Multi-modal Alignment (JMA) to fuse vision and language representations into a unified 3D-aware space. The authors further extend the framework with JM3D-LLM, enabling fine-tuned integration of 3D semantics into large language models for detailed captions and reasoning. Empirically, JM3D achieves state-of-the-art zero-shot 3D classification on ModelNet40 and ScanObjectNN, while JM3D-LLM demonstrates enhanced descriptive and cross-modal reasoning capabilities, validating the practical impact of joint multimodal transfer for 3D understanding.

Abstract

The rising importance of 3D understanding, pivotal in computer vision, autonomous driving, and robotics, is evident. However, a prevailing trend, which straightforwardly resorted to transferring 2D alignment strategies to the 3D domain, encounters three distinct challenges: (1) Information Degradation: This arises from the alignment of 3D data with mere single-view 2D images and generic texts, neglecting the need for multi-view images and detailed subcategory texts. (2) Insufficient Synergy: These strategies align 3D representations to image and text features individually, hampering the overall optimization for 3D models. (3) Underutilization: The fine-grained information inherent in the learned representations is often not fully exploited, indicating a potential loss in detail. To address these issues, we introduce JM3D, a comprehensive approach integrating point cloud, text, and image. Key contributions include the Structured Multimodal Organizer (SMO), enriching vision-language representation with multiple views and hierarchical text, and the Joint Multi-modal Alignment (JMA), combining language understanding with visual representation. Our advanced model, JM3D-LLM, marries 3D representation with large language models via efficient fine-tuning. Evaluations on ModelNet40 and ScanObjectNN establish JM3D's superiority. The superior performance of JM3D-LLM further underscores the effectiveness of our representation transfer approach. Our code and models are available at https://github.com/Mr-Neko/JM3D.

JM3D & JM3D-LLM: Elevating 3D Understanding with Joint Multi-modal Cues

TL;DR

Abstract

Paper Structure (38 sections, 14 equations, 4 figures, 10 tables)

This paper contains 38 sections, 14 equations, 4 figures, 10 tables.

Introduction
Related work
Representation Learning in 3D Space
Representation Learning in Multi-modal Space
Enhancing 3D Representation through Multi-modality
Large Language Model
Joint Multi-modal 3D Representation Learning
Preliminary
Structured Multi-modal Organizer
Continuous Image Sequence
Hierarchical Text Tree
Joint Multi-modal Alignment
Training Objective
Integrating JM3D with Large Language Model
Instruct Conversations for Point Querying
...and 23 more sections

Figures (4)

Figure 1: The visualization of JM3D. JM3D coherently aligns the 3D modality with previously aligned vision and language modalities, forming a consolidated tri-modal representation. Subsequently, the derived representation finds application in tasks such as image-3D retrieval, zero-shot 3D classification, and interfaces with LLM to discern more granular information.
Figure 2: The framework of JM3D. Continuous Image Sequence (CIS) and Hierarchical Text Tree (HTT) organized continuous multi-view images and hierarchical texts respectively, which are fed into a pre-training model (frozen) to extract features on the left. Then, Joint Multi-modal Alignment (JMA) incorporates the features from two modalities to generate the joint modeling features. On the last, contrastive learning is applied to align 3D features (training) with joint features and subcategory texts, while 3D features are aggregated with the assistance of the parent category.
Figure 3: The framework of JM3D-LLM. We take the LLM as the cornerstone to support the further semantic understanding task like the fine-grained 3D model captioning.
Figure 4: The qualitative results of the real image to point cloud retrieval. Giving an image, We show the top-3 point cloud retrieval results from ModelNet40. All models perform well on the simple samples (the 1st row and the 3rd row). However, when it comes to the challenging samples (the 2nd row and the 4th row), JM3D demonstrates a more accurate retrieval ability compared to the previous state-of-the-art (ULIP). The JM3D trained with 4 view images shows better performance compared with the 2 view images, benefiting the more solid bias of vision modality.

JM3D & JM3D-LLM: Elevating 3D Understanding with Joint Multi-modal Cues

TL;DR

Abstract

JM3D & JM3D-LLM: Elevating 3D Understanding with Joint Multi-modal Cues

Authors

TL;DR

Abstract

Table of Contents

Figures (4)