Table of Contents
Fetching ...

Unveiling the Mist over 3D Vision-Language Understanding: Object-centric Evaluation with Chain-of-Analysis

Jiangyong Huang, Baoxiong Jia, Yan Wang, Ziyu Zhu, Xiongkun Linghu, Qing Li, Song-Chun Zhu, Siyuan Huang

TL;DR

Beacon3D is proposed, a benchmark for 3D-VL grounding and QA tasks, delivering a perspective shift in the evaluation of 3D-VL understanding, and evaluation of state-of-the-art 3D-VL models on Beacon3D reveals that object-centric evaluation elicits true model performance and particularly weak generalization in QA.

Abstract

Existing 3D vision-language (3D-VL) benchmarks fall short in evaluating 3D-VL models, creating a "mist" that obscures rigorous insights into model capabilities and 3D-VL tasks. This mist persists due to three key limitations. First, flawed test data, like ambiguous referential text in the grounding task, can yield incorrect and unreliable test results. Second, oversimplified metrics such as simply averaging accuracy per question answering (QA) pair, cannot reveal true model capability due to their vulnerability to language variations. Third, existing benchmarks isolate the grounding and QA tasks, disregarding the underlying coherence that QA should be based on solid grounding capabilities. To unveil the "mist", we propose Beacon3D, a benchmark for 3D-VL grounding and QA tasks, delivering a perspective shift in the evaluation of 3D-VL understanding. Beacon3D features (i) high-quality test data with precise and natural language, (ii) object-centric evaluation with multiple tests per object to ensure robustness, and (iii) a novel chain-of-analysis paradigm to address language robustness and model performance coherence across grounding and QA. Our evaluation of state-of-the-art 3D-VL models on Beacon3D reveals that (i) object-centric evaluation elicits true model performance and particularly weak generalization in QA; (ii) grounding-QA coherence remains fragile in current 3D-VL models, and (iii) incorporating large language models (LLMs) to 3D-VL models, though as a prevalent practice, hinders grounding capabilities and has yet to elevate QA capabilities. We hope Beacon3D and our comprehensive analysis could benefit the 3D-VL community towards faithful developments.

Unveiling the Mist over 3D Vision-Language Understanding: Object-centric Evaluation with Chain-of-Analysis

TL;DR

Beacon3D is proposed, a benchmark for 3D-VL grounding and QA tasks, delivering a perspective shift in the evaluation of 3D-VL understanding, and evaluation of state-of-the-art 3D-VL models on Beacon3D reveals that object-centric evaluation elicits true model performance and particularly weak generalization in QA.

Abstract

Existing 3D vision-language (3D-VL) benchmarks fall short in evaluating 3D-VL models, creating a "mist" that obscures rigorous insights into model capabilities and 3D-VL tasks. This mist persists due to three key limitations. First, flawed test data, like ambiguous referential text in the grounding task, can yield incorrect and unreliable test results. Second, oversimplified metrics such as simply averaging accuracy per question answering (QA) pair, cannot reveal true model capability due to their vulnerability to language variations. Third, existing benchmarks isolate the grounding and QA tasks, disregarding the underlying coherence that QA should be based on solid grounding capabilities. To unveil the "mist", we propose Beacon3D, a benchmark for 3D-VL grounding and QA tasks, delivering a perspective shift in the evaluation of 3D-VL understanding. Beacon3D features (i) high-quality test data with precise and natural language, (ii) object-centric evaluation with multiple tests per object to ensure robustness, and (iii) a novel chain-of-analysis paradigm to address language robustness and model performance coherence across grounding and QA. Our evaluation of state-of-the-art 3D-VL models on Beacon3D reveals that (i) object-centric evaluation elicits true model performance and particularly weak generalization in QA; (ii) grounding-QA coherence remains fragile in current 3D-VL models, and (iii) incorporating large language models (LLMs) to 3D-VL models, though as a prevalent practice, hinders grounding capabilities and has yet to elevate QA capabilities. We hope Beacon3D and our comprehensive analysis could benefit the 3D-VL community towards faithful developments.

Paper Structure

This paper contains 54 sections, 13 figures, 9 tables.

Figures (13)

  • Figure 1: An overview of Beacon3D, a novel benchmark for 3D grounding and question answering (QA) tasks.Beacon3D features an object-centric evaluation framework, with Grounding-Chains (G-Chains) and Grounding-QA-Chains (GQA-Chains) for each object. The evaluation adopts object-centric metrics to ensure robustness and utilizes chain-of-analysis for studies in task coherence. We also involve the study of various knowledge types such as class, appearance ("App."), spatial ("Spa."), and geometry ("Geo.").
  • Figure 2: Various types of test data flaws in ScanRefer, Nr3D, ScanQA.Underlined texts indicate explicit flaws. (1) The top row shows grounding data with the target object highlighted. Ambiguous text includes viewpoint-dependent expressions like "left" and "right", or lacks information to uniquely specify the target object. Unnatural descriptions are hard to understand by humans for being too tedious or grammatically invalid. Incorrect annotation refers to the mismatch between text and target object. (2) The bottom row shows qa data with ground truth (GT) shown in square brackets. Ambiguous question lacks context to clarify the queried object, potentially leading to contradictory answers. Incomplete answers may forbid alternative correct answers.
  • Figure 3: Illustrative examples on visual ignorance. The model predicts answers directly from questions, ignoring scene information (e.g., chair color).
  • Figure 4: Illustrative examples on language robustness. Rephrased and more detailed questions of the same concept can easily lead to wrong model predictions.
  • Figure 5: (a) Illustration of gqac. The questions derive from the grounding text and query a specific feature of the target object. We define two broken types for grounding-qa coherence: (Type 1) correct grounding and incorrect qa, indicating a lack of qa skills; (Type 2) incorrect grounding and correct qa, suggesting shortcuts in qa. (b) The effect of rephrasing ScanRefer texts on the performance of PQ3D.(c) The effect of rephrasing SQA3D questions on the performance of PQ3D.(d) Results of PQ3D on gqac. We observe over half of qa failures (24% out of 46%) stem from insufficient qa skills while nearly a quarter of correct QA predictions (14% out of 54%) are achieved via shortcuts.
  • ...and 8 more figures