MM-Mixing: Multi-Modal Mixing Alignment for 3D Understanding

Jiaze Wang; Yi Wang; Ziyu Guo; Renrui Zhang; Donghao Zhou; Guangyong Chen; Anfeng Liu; Pheng-Ann Heng

MM-Mixing: Multi-Modal Mixing Alignment for 3D Understanding

Jiaze Wang, Yi Wang, Ziyu Guo, Renrui Zhang, Donghao Zhou, Guangyong Chen, Anfeng Liu, Pheng-Ann Heng

TL;DR

MM-Mixing proposes a two-stage, multi-modal mixing framework for 3D understanding that aligns 3D point clouds with text and images via feature-level and input-level mixing. Stage 1 trains a 3D Feature Mixing Encoder while keeping image/text encoders frozen to establish cross-modal consistency through contrastive learning; Stage 2 introduces mixed point clouds and trains a new 3D encoder to further refine representations, guided by cross-modal losses. The approach yields substantial gains in zero-shot classification, linear probing, and cross-modal retrieval across multiple datasets and backbones (e.g., ScanObjectNN improves from $51.3\%$ to $61.9\%$, Objaverse-LVIS from $46.8\%$ to $51.4\%$), demonstrating strong generalization and compatibility with existing 3D frameworks. Overall, MM-Mixing offers a straightforward, scalable, and effective path to improving multi-modal alignment and 3D understanding.

Abstract

We introduce MM-Mixing, a multi-modal mixing alignment framework for 3D understanding. MM-Mixing applies mixing-based methods to multi-modal data, preserving and optimizing cross-modal connections while enhancing diversity and improving alignment across modalities. Our proposed two-stage training pipeline combines feature-level and input-level mixing to optimize the 3D encoder. The first stage employs feature-level mixing with contrastive learning to align 3D features with their corresponding modalities. The second stage incorporates both feature-level and input-level mixing, introducing mixed point cloud inputs to further refine 3D feature representations. MM-Mixing enhances intermodality relationships, promotes generalization, and ensures feature consistency while providing diverse and realistic training samples. We demonstrate that MM-Mixing significantly improves baseline performance across various learning scenarios, including zero-shot 3D classification, linear probing 3D classification, and cross-modal 3D shape retrieval. Notably, we improved the zero-shot classification accuracy on ScanObjectNN from 51.3% to 61.9%, and on Objaverse-LVIS from 46.8% to 51.4%. Our findings highlight the potential of multi-modal mixing-based alignment to significantly advance 3D object recognition and understanding while remaining straightforward to implement and integrate into existing frameworks.

MM-Mixing: Multi-Modal Mixing Alignment for 3D Understanding

TL;DR

, Objaverse-LVIS from

), demonstrating strong generalization and compatibility with existing 3D frameworks. Overall, MM-Mixing offers a straightforward, scalable, and effective path to improving multi-modal alignment and 3D understanding.

Abstract

Paper Structure (19 sections, 5 equations, 8 figures, 7 tables)

This paper contains 19 sections, 5 equations, 8 figures, 7 tables.

Introduction
Related Works
Method
Problem Definition
Multi-Modal Mixing
MM-Mixing Framework
Experiments
Experimental Setup
Zero-shot 3D Classification
Linear Probing 3D Classification
Ablation Study
Qualitative Analysis
Conclusion
Appendix / supplemental material
Training Details
...and 4 more sections

Figures (8)

Figure 1: Performance comparison with previous methods. MM-Mixing achieves better performance than previous pre-training methods across various datasets with the same backbone Point-BERT. "ModelNet40-ShapeNet" represents the model is pretrained on ShapeNet and evaluated on ModelNet40, similarly for other dataset combinations.
Figure 2: The overall scheme of MM-Mixing. MM-Mixing consists of two stages. In the first stage, the point cloud FM-Encoder is trainable, while the image and text FM-Encoders are pre-trained and frozen. Feature embeddings are extracted for contrastive learning with the 3D features. In the second stage, we initialize a new trainable 3D encoder. All FM-Encoders remain frozen. Two input point clouds are mixed using FPS and point-level mixing, and then fed into the 3D encoder. Then we adopt contrastive learning to align the features of mixed point clouds with mixed feature representations of all three modalities.
Figure 3: Hard sample recognition on ModelNet40. Compared to OpenShape, MM-Mixing enables the model to better capture typical features across different categories and the ability to distinguish hard samples.
Figure 4: Cross-modal 3D shape retrieval on Objaverse. Compared to OpenShape, MM-Mixing enhances the model's understanding of point cloud shapes, image colors, and textual descriptions, effectively improving cross-modal 3D shape retrieval capabilities. PC represents Point Cloud.
Figure 5: Zero-shot 3D classification qualitative results on ModelNet40. Compared to OpenShape, TripletMix not only provides the correct top categoriy, but also obtains higher similarity scores.
...and 3 more figures

MM-Mixing: Multi-Modal Mixing Alignment for 3D Understanding

TL;DR

Abstract

MM-Mixing: Multi-Modal Mixing Alignment for 3D Understanding

Authors

TL;DR

Abstract

Table of Contents

Figures (8)