Table of Contents
Fetching ...

3D-MoE: A Mixture-of-Experts Multi-modal LLM for 3D Vision and Pose Diffusion via Rectified Flow

Yueen Ma, Yuzheng Zhuang, Jianye Hao, Irwin King

TL;DR

This work tackles 3D vision and spatial reasoning by converting pretrained dense LLMs into mixture-of-experts (MoE) models to form a 3D multimodal LLM, 3D-MoE. It couples a 3D vision encoder with an MoE LLM via a two-stage training scheme and transfers FFN weights into MoE experts to retain pretrained knowledge, followed by LoRA fine-tuning. A Pose-DiT diffusion head with a rectified flow scheduler enables efficient 6D pose prediction for embodied tasks, achieving faster inference. Experiments on 3D question answering and embodied task planning show improved performance with fewer activated parameters compared to larger 7B-scale baselines, highlighting the method’s efficiency and effectiveness in 3D reasoning and planning.

Abstract

3D vision and spatial reasoning have long been recognized as preferable for accurately perceiving our three-dimensional world, especially when compared with traditional visual reasoning based on 2D images. Due to the difficulties in collecting high-quality 3D data, research in this area has only recently gained momentum. With the advent of powerful large language models (LLMs), multi-modal LLMs for 3D vision have been developed over the past few years. However, most of these models focus primarily on the vision encoder for 3D data. In this paper, we propose converting existing densely activated LLMs into mixture-of-experts (MoE) models, which have proven effective for multi-modal data processing. In addition to leveraging these models' instruction-following capabilities, we further enable embodied task planning by attaching a diffusion head, Pose-DiT, that employs a novel rectified flow diffusion scheduler. Experimental results on 3D question answering and task-planning tasks demonstrate that our 3D-MoE framework achieves improved performance with fewer activated parameters.

3D-MoE: A Mixture-of-Experts Multi-modal LLM for 3D Vision and Pose Diffusion via Rectified Flow

TL;DR

This work tackles 3D vision and spatial reasoning by converting pretrained dense LLMs into mixture-of-experts (MoE) models to form a 3D multimodal LLM, 3D-MoE. It couples a 3D vision encoder with an MoE LLM via a two-stage training scheme and transfers FFN weights into MoE experts to retain pretrained knowledge, followed by LoRA fine-tuning. A Pose-DiT diffusion head with a rectified flow scheduler enables efficient 6D pose prediction for embodied tasks, achieving faster inference. Experiments on 3D question answering and embodied task planning show improved performance with fewer activated parameters compared to larger 7B-scale baselines, highlighting the method’s efficiency and effectiveness in 3D reasoning and planning.

Abstract

3D vision and spatial reasoning have long been recognized as preferable for accurately perceiving our three-dimensional world, especially when compared with traditional visual reasoning based on 2D images. Due to the difficulties in collecting high-quality 3D data, research in this area has only recently gained momentum. With the advent of powerful large language models (LLMs), multi-modal LLMs for 3D vision have been developed over the past few years. However, most of these models focus primarily on the vision encoder for 3D data. In this paper, we propose converting existing densely activated LLMs into mixture-of-experts (MoE) models, which have proven effective for multi-modal data processing. In addition to leveraging these models' instruction-following capabilities, we further enable embodied task planning by attaching a diffusion head, Pose-DiT, that employs a novel rectified flow diffusion scheduler. Experimental results on 3D question answering and task-planning tasks demonstrate that our 3D-MoE framework achieves improved performance with fewer activated parameters.

Paper Structure

This paper contains 14 sections, 4 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: The architecture of 3D-MoE. In Stage I (left), we align the 3D vision encoder with the LLM by pertaining the linear projection layer. In Stage II (right), we derive our 3D-MoE model from the pretrained multi-modal LLM, and fine-tune it with LoRA on downstream 3D tasks.
  • Figure 2: The architecture of Pose-DiT.