Table of Contents
Fetching ...

MM-Mixing: Multi-Modal Mixing Alignment for 3D Understanding

Jiaze Wang, Yi Wang, Ziyu Guo, Renrui Zhang, Donghao Zhou, Guangyong Chen, Anfeng Liu, Pheng-Ann Heng

TL;DR

MM-Mixing proposes a two-stage, multi-modal mixing framework for 3D understanding that aligns 3D point clouds with text and images via feature-level and input-level mixing. Stage 1 trains a 3D Feature Mixing Encoder while keeping image/text encoders frozen to establish cross-modal consistency through contrastive learning; Stage 2 introduces mixed point clouds and trains a new 3D encoder to further refine representations, guided by cross-modal losses. The approach yields substantial gains in zero-shot classification, linear probing, and cross-modal retrieval across multiple datasets and backbones (e.g., ScanObjectNN improves from $51.3\%$ to $61.9\%$, Objaverse-LVIS from $46.8\%$ to $51.4\%$), demonstrating strong generalization and compatibility with existing 3D frameworks. Overall, MM-Mixing offers a straightforward, scalable, and effective path to improving multi-modal alignment and 3D understanding.

Abstract

We introduce MM-Mixing, a multi-modal mixing alignment framework for 3D understanding. MM-Mixing applies mixing-based methods to multi-modal data, preserving and optimizing cross-modal connections while enhancing diversity and improving alignment across modalities. Our proposed two-stage training pipeline combines feature-level and input-level mixing to optimize the 3D encoder. The first stage employs feature-level mixing with contrastive learning to align 3D features with their corresponding modalities. The second stage incorporates both feature-level and input-level mixing, introducing mixed point cloud inputs to further refine 3D feature representations. MM-Mixing enhances intermodality relationships, promotes generalization, and ensures feature consistency while providing diverse and realistic training samples. We demonstrate that MM-Mixing significantly improves baseline performance across various learning scenarios, including zero-shot 3D classification, linear probing 3D classification, and cross-modal 3D shape retrieval. Notably, we improved the zero-shot classification accuracy on ScanObjectNN from 51.3% to 61.9%, and on Objaverse-LVIS from 46.8% to 51.4%. Our findings highlight the potential of multi-modal mixing-based alignment to significantly advance 3D object recognition and understanding while remaining straightforward to implement and integrate into existing frameworks.

MM-Mixing: Multi-Modal Mixing Alignment for 3D Understanding

TL;DR

MM-Mixing proposes a two-stage, multi-modal mixing framework for 3D understanding that aligns 3D point clouds with text and images via feature-level and input-level mixing. Stage 1 trains a 3D Feature Mixing Encoder while keeping image/text encoders frozen to establish cross-modal consistency through contrastive learning; Stage 2 introduces mixed point clouds and trains a new 3D encoder to further refine representations, guided by cross-modal losses. The approach yields substantial gains in zero-shot classification, linear probing, and cross-modal retrieval across multiple datasets and backbones (e.g., ScanObjectNN improves from to , Objaverse-LVIS from to ), demonstrating strong generalization and compatibility with existing 3D frameworks. Overall, MM-Mixing offers a straightforward, scalable, and effective path to improving multi-modal alignment and 3D understanding.

Abstract

We introduce MM-Mixing, a multi-modal mixing alignment framework for 3D understanding. MM-Mixing applies mixing-based methods to multi-modal data, preserving and optimizing cross-modal connections while enhancing diversity and improving alignment across modalities. Our proposed two-stage training pipeline combines feature-level and input-level mixing to optimize the 3D encoder. The first stage employs feature-level mixing with contrastive learning to align 3D features with their corresponding modalities. The second stage incorporates both feature-level and input-level mixing, introducing mixed point cloud inputs to further refine 3D feature representations. MM-Mixing enhances intermodality relationships, promotes generalization, and ensures feature consistency while providing diverse and realistic training samples. We demonstrate that MM-Mixing significantly improves baseline performance across various learning scenarios, including zero-shot 3D classification, linear probing 3D classification, and cross-modal 3D shape retrieval. Notably, we improved the zero-shot classification accuracy on ScanObjectNN from 51.3% to 61.9%, and on Objaverse-LVIS from 46.8% to 51.4%. Our findings highlight the potential of multi-modal mixing-based alignment to significantly advance 3D object recognition and understanding while remaining straightforward to implement and integrate into existing frameworks.
Paper Structure (19 sections, 5 equations, 8 figures, 7 tables)

This paper contains 19 sections, 5 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Performance comparison with previous methods. MM-Mixing achieves better performance than previous pre-training methods across various datasets with the same backbone Point-BERT. "ModelNet40-ShapeNet" represents the model is pretrained on ShapeNet and evaluated on ModelNet40, similarly for other dataset combinations.
  • Figure 2: The overall scheme of MM-Mixing. MM-Mixing consists of two stages. In the first stage, the point cloud FM-Encoder is trainable, while the image and text FM-Encoders are pre-trained and frozen. Feature embeddings are extracted for contrastive learning with the 3D features. In the second stage, we initialize a new trainable 3D encoder. All FM-Encoders remain frozen. Two input point clouds are mixed using FPS and point-level mixing, and then fed into the 3D encoder. Then we adopt contrastive learning to align the features of mixed point clouds with mixed feature representations of all three modalities.
  • Figure 3: Hard sample recognition on ModelNet40. Compared to OpenShape, MM-Mixing enables the model to better capture typical features across different categories and the ability to distinguish hard samples.
  • Figure 4: Cross-modal 3D shape retrieval on Objaverse. Compared to OpenShape, MM-Mixing enhances the model's understanding of point cloud shapes, image colors, and textual descriptions, effectively improving cross-modal 3D shape retrieval capabilities. PC represents Point Cloud.
  • Figure 5: Zero-shot 3D classification qualitative results on ModelNet40. Compared to OpenShape, TripletMix not only provides the correct top categoriy, but also obtains higher similarity scores.
  • ...and 3 more figures