Table of Contents
Fetching ...

Robin3D: Improving 3D Large Language Model via Robust Instruction Tuning

Weitai Kang, Haifeng Huang, Yuzhang Shang, Mubarak Shah, Yan Yan

TL;DR

Robin3D tackles the lack of robust instruction data for 3D LLMs by introducing Robust Instruction Generation, which builds 1 million instruction-following samples combining Adversarial and Diverse data streams. The approach integrates a Chat-Scene backbone with two architectural innovations, the Relation-Augmented Projector and a post-vision sequence organization, and trains with LoRA on Vicuna-7B-v1.5 using a two-stage regimen, achieving state-of-the-art performance across 12 3D vision–language benchmarks without task-specific fine-tuning. Key contributions include four adversarial tasks (HOPE, HROC, PF-3DVG, 3DFQA), a diverse data pipeline via ChatGPT rephrasing, and architectural gains that improve spatial grounding and object referencing. The results demonstrate strong generalization and discriminative ability for 3D grounding, captioning, and QA, with implications for robust, open-ended 3D agents and future work exploring outdoor data and open-vocabulary capabilities.

Abstract

Recent advancements in 3D Large Language Models (3DLLMs) have highlighted their potential in building general-purpose agents in the 3D real world, yet challenges remain due to the lack of high-quality robust instruction-following data, leading to limited discriminative power and generalization of 3DLLMs. In this paper, we introduce Robin3D, a powerful 3DLLM trained on large-scale instruction-following data generated by our novel data engine, Robust Instruction Generation (RIG) engine. RIG generates two key instruction data: 1) the Adversarial Instruction-following data, which features mixed negative and positive samples to enhance the model's discriminative understanding. 2) the Diverse Instruction-following data, which contains various instruction styles to enhance model's generalization. As a result, we construct 1 million instruction-following data, consisting of 344K Adversarial samples, 508K Diverse samples, and 165K benchmark training set samples. To better handle these complex instructions, Robin3D first incorporates Relation-Augmented Projector to enhance spatial understanding, and then strengthens the object referring and grounding ability through ID-Feature Bonding. Robin3D consistently outperforms previous methods across five widely-used 3D multimodal learning benchmarks, without the need for task-specific fine-tuning. Notably, we achieve a 7.8\% improvement in the grounding task (Multi3DRefer) and a 6.9\% improvement in the captioning task (Scan2Cap).

Robin3D: Improving 3D Large Language Model via Robust Instruction Tuning

TL;DR

Robin3D tackles the lack of robust instruction data for 3D LLMs by introducing Robust Instruction Generation, which builds 1 million instruction-following samples combining Adversarial and Diverse data streams. The approach integrates a Chat-Scene backbone with two architectural innovations, the Relation-Augmented Projector and a post-vision sequence organization, and trains with LoRA on Vicuna-7B-v1.5 using a two-stage regimen, achieving state-of-the-art performance across 12 3D vision–language benchmarks without task-specific fine-tuning. Key contributions include four adversarial tasks (HOPE, HROC, PF-3DVG, 3DFQA), a diverse data pipeline via ChatGPT rephrasing, and architectural gains that improve spatial grounding and object referencing. The results demonstrate strong generalization and discriminative ability for 3D grounding, captioning, and QA, with implications for robust, open-ended 3D agents and future work exploring outdoor data and open-vocabulary capabilities.

Abstract

Recent advancements in 3D Large Language Models (3DLLMs) have highlighted their potential in building general-purpose agents in the 3D real world, yet challenges remain due to the lack of high-quality robust instruction-following data, leading to limited discriminative power and generalization of 3DLLMs. In this paper, we introduce Robin3D, a powerful 3DLLM trained on large-scale instruction-following data generated by our novel data engine, Robust Instruction Generation (RIG) engine. RIG generates two key instruction data: 1) the Adversarial Instruction-following data, which features mixed negative and positive samples to enhance the model's discriminative understanding. 2) the Diverse Instruction-following data, which contains various instruction styles to enhance model's generalization. As a result, we construct 1 million instruction-following data, consisting of 344K Adversarial samples, 508K Diverse samples, and 165K benchmark training set samples. To better handle these complex instructions, Robin3D first incorporates Relation-Augmented Projector to enhance spatial understanding, and then strengthens the object referring and grounding ability through ID-Feature Bonding. Robin3D consistently outperforms previous methods across five widely-used 3D multimodal learning benchmarks, without the need for task-specific fine-tuning. Notably, we achieve a 7.8\% improvement in the grounding task (Multi3DRefer) and a 6.9\% improvement in the captioning task (Scan2Cap).
Paper Structure (27 sections, 1 equation, 13 figures, 2 tables)

This paper contains 27 sections, 1 equation, 13 figures, 2 tables.

Figures (13)

  • Figure 1: Robin3D surpasses previous SOTA on all 12 benchmarks by training on our RIG-generated 1 million data.
  • Figure 2: The visualization of examples of adversarial / negative data. For better visualization, we associate each object ID with the same color as its bounding box. The black solid circles with numbers are solely for visualization purposes and are not included in the actual data.
  • Figure 3: Pipeline to generate our Diverse Instruction data by the in-context learning of ChatGPT.
  • Figure 4: The one-shot examples for ChatGPT to rephrase the instruction-following data.
  • Figure 5: Overview of Robin3D model structure. Bottom: rgb]0.886, 0.941, 0.851Relation-Augmented Projector fuses the features and position embedding from Mask3D and Uni3D for final 3D features. 2D features from DINO v2 are projected into the LLM space. Middle: rgb]0.984, 0.898, 0.839Sequence Organization enhances the connection between object IDs and object features by wrapping the features with identical IDs and the Post-Vision order. Top: We use LoRA to fine-tune the LLM on our constructed 1 million instruction data.
  • ...and 8 more figures