Table of Contents
Fetching ...

Disc3D: Automatic Curation of High-Quality 3D Dialog Data via Discriminative Object Referring

Siyuan Wei, Chunjie Wang, Xiao Liu, Xiaosheng Yan, Zhishan Zhou, Rui Huang

TL;DR

Disc3D automates the creation of high-quality 3D scene dialogue data by addressing viewpoint and object-referring ambiguities with a discriminative object referring framework. The four-stage pipeline—meta-annotation, scene graph construction, discriminative referring, and multi-task data generation—produces a multi-task Disc3D dataset with over 2 million samples across 25K scenes, enabling robust training of 3D MLLMs. Empirical results show Disc3D improves visual grounding and QA performance on Disc3D and public benchmarks, with a two-stage training paradigm and task-mixing strategy yielding the best gains. The work reduces reliance on manual annotation, delivers a scalable, controllable data generation process, and provides a benchmark suite to benchmark 3D MLLMs, highlighting avenues for architecture-data co-evolution with remaining challenges such as missing annotations and reflection effects.

Abstract

3D Multi-modal Large Language Models (MLLMs) still lag behind their 2D peers, largely because large-scale, high-quality 3D scene-dialogue datasets remain scarce. Prior efforts hinge on expensive human annotation and leave two key ambiguities unresolved: viewpoint ambiguity, where spatial language presumes unknown camera poses, and object referring ambiguity, where non-exclusive descriptions blur the line between targets and distractors. We therefore present a fully automated pipeline that converts raw 3D scans into unambiguous, high-quality dialogue data at a fraction of the previous cost. By synergizing rule-based constraints with 2D MLLMs and LLMs, the pipeline enables controllable, scalable generation without human intervention. The pipeline comprises four stages: (1) meta-annotation collection harvesting object-, frame-, and scene-level captions, (2) scene graph construction with relation correction to capture proximal object relations, (3) discriminative object referring that generates exclusive and compact descriptions, and (4) multi-task data generation synthesizing diverse dialogues. Our pipeline systematically mitigates inherent flaws in source datasets and produces the final Disc3D dataset, over 2 million samples in 25K hybrid 3D scenes, spanning scene, view, and object captioning, visual grounding, and five object-centric QA tasks. Extensive experiments demonstrate that training with Disc3D yields consistent, significant improvements on both public benchmarks and our multifaceted Disc3D-QA tasks. Code, data, and models will be publicly available.

Disc3D: Automatic Curation of High-Quality 3D Dialog Data via Discriminative Object Referring

TL;DR

Disc3D automates the creation of high-quality 3D scene dialogue data by addressing viewpoint and object-referring ambiguities with a discriminative object referring framework. The four-stage pipeline—meta-annotation, scene graph construction, discriminative referring, and multi-task data generation—produces a multi-task Disc3D dataset with over 2 million samples across 25K scenes, enabling robust training of 3D MLLMs. Empirical results show Disc3D improves visual grounding and QA performance on Disc3D and public benchmarks, with a two-stage training paradigm and task-mixing strategy yielding the best gains. The work reduces reliance on manual annotation, delivers a scalable, controllable data generation process, and provides a benchmark suite to benchmark 3D MLLMs, highlighting avenues for architecture-data co-evolution with remaining challenges such as missing annotations and reflection effects.

Abstract

3D Multi-modal Large Language Models (MLLMs) still lag behind their 2D peers, largely because large-scale, high-quality 3D scene-dialogue datasets remain scarce. Prior efforts hinge on expensive human annotation and leave two key ambiguities unresolved: viewpoint ambiguity, where spatial language presumes unknown camera poses, and object referring ambiguity, where non-exclusive descriptions blur the line between targets and distractors. We therefore present a fully automated pipeline that converts raw 3D scans into unambiguous, high-quality dialogue data at a fraction of the previous cost. By synergizing rule-based constraints with 2D MLLMs and LLMs, the pipeline enables controllable, scalable generation without human intervention. The pipeline comprises four stages: (1) meta-annotation collection harvesting object-, frame-, and scene-level captions, (2) scene graph construction with relation correction to capture proximal object relations, (3) discriminative object referring that generates exclusive and compact descriptions, and (4) multi-task data generation synthesizing diverse dialogues. Our pipeline systematically mitigates inherent flaws in source datasets and produces the final Disc3D dataset, over 2 million samples in 25K hybrid 3D scenes, spanning scene, view, and object captioning, visual grounding, and five object-centric QA tasks. Extensive experiments demonstrate that training with Disc3D yields consistent, significant improvements on both public benchmarks and our multifaceted Disc3D-QA tasks. Code, data, and models will be publicly available.

Paper Structure

This paper contains 41 sections, 15 figures, 6 tables, 3 algorithms.

Figures (15)

  • Figure 1: Examples of ambiguity in ScanQA, including (a) viewpoint ambiguity: the relative positions of objects (e.g., right or left) implied in a dialogue depend on viewpoint information not present in the context; and (b) object referring ambiguity: object descriptions (e.g., chair) lack discriminative detail, resulting in confusion between the target object and distractors.
  • Figure 2: Common defects in 3D scene datasets that impede dialogue data generation: a) Semantic label subsumption (e.g., office chair vs. chair) induces referential ambiguity; b) Non-canonical camera views degrade 2D MLLM annotation accuracy; c) Blurry images reduce visual confidence; d) Noisy point clouds produce inaccurate 9-DOF boxes, impairing downstream learning.
  • Figure 3: Overview of our Disc3D dataset. (a) Disc3D comprises millions of dialogue samples across 25K hybrid (real and synthetic) scenes, covering five object-centric QA tasks and three caption tasks for training & benchmarking 3D MLLMs. Object descriptions are color-coded to match the corresponding highlighted objects. (b) Task and scene distributions in the training set. The object counting task is excluded due to category-specific answer bias.
  • Figure 4: A schematic overview of two key stages in our data curation pipeline. Specifically: a) Scene Graph Construction proceeds in two stages: an initial graph is built automatically, after which a LLM & MLLM-guided module refines the misjudged relations (highlighted in red). b) Discriminative Object Referring produces exclusive, unambiguous descriptions for each object in the distractor group across five orthogonal axes. The first sub-stage, Comparative Disambiguation, contrasts objects of the same category along appearance, size, and relational cues; the second sub-stage, Spatially Anchoring, injects 3D context by explicitly conditioning every description on the designated anchor object or sight.
  • Figure 5: Discriminative object referral examples in Scannet++ yeshwanthliu2023scannetpp scans.
  • ...and 10 more figures