Can 3D Vision-Language Models Truly Understand Natural Language?

Weipeng Deng; Jihan Yang; Runyu Ding; Jiahui Liu; Yijiang Li; Xiaojuan Qi; Edith Ngai

Can 3D Vision-Language Models Truly Understand Natural Language?

Weipeng Deng, Jihan Yang, Runyu Ding, Jiahui Liu, Yijiang Li, Xiaojuan Qi, Edith Ngai

TL;DR

This work reveals that 3D vision-language models falter when faced with natural language variations that preserve meaning. By constructing the 3D Language Robustness (3D-LR) Benchmark and dataset, the authors systematically evaluate model performance across five language variant styles and three 3D-VL tasks, uncovering significant degradations even for state-of-the-art 3D-LLMs. They identify the fusion module as a major source of brittleness and demonstrate a training-free, LLM-based pre-alignment module that substantially improves robustness without additional training. The findings highlight the need for greater linguistic diversity in datasets and provide a practical method to boost robustness, with broad implications for embodied agents and robotics deployments.

Abstract

Rapid advancements in 3D vision-language (3D-VL) tasks have opened up new avenues for human interaction with embodied agents or robots using natural language. Despite this progress, we find a notable limitation: existing 3D-VL models exhibit sensitivity to the styles of language input, struggling to understand sentences with the same semantic meaning but written in different variants. This observation raises a critical question: Can 3D vision-language models truly understand natural language? To test the language understandability of 3D-VL models, we first propose a language robustness task for systematically assessing 3D-VL models across various tasks, benchmarking their performance when presented with different language style variants. Importantly, these variants are commonly encountered in applications requiring direct interaction with humans, such as embodied robotics, given the diversity and unpredictability of human language. We propose a 3D Language Robustness Dataset, designed based on the characteristics of human language, to facilitate the systematic study of robustness. Our comprehensive evaluation uncovers a significant drop in the performance of all existing models across various 3D-VL tasks. Even the state-of-the-art 3D-LLM fails to understand some variants of the same sentences. Further in-depth analysis suggests that the existing models have a fragile and biased fusion module, which stems from the low diversity of the existing dataset. Finally, we propose a training-free module driven by LLM, which improves language robustness. Datasets and code will be available at github.

Can 3D Vision-Language Models Truly Understand Natural Language?

TL;DR

Abstract

Paper Structure (26 sections, 16 figures, 16 tables)

This paper contains 26 sections, 16 figures, 16 tables.

Introduction
Related Works
3D Language Robustness (3D-LR) Benchmark
3D Language Robustness Task
3D Language Robustness (3D-LR) dataset
Experiments
Main Results
Analysis and Improved Model
Why do the 3D-VL models fail?
Plug and Play Pre-Alignment Module
Discussion on Data Augmentation
Conclusion
More Examples
Datasets
Basic Statistics
...and 11 more sections

Figures (16)

Figure 1: Fragility of 3D-VL models in natural language understanding. This figure illustrates the significant performance degradation of 3D Vision-Language models when faced with natural language variations common in human communication. The variations tested include: (a). the original sentence in the training set. (b). shifting the voice from active voice to passive voice. (c). Saying the same thing in a different accent. (d). Saying in a new conversation tone. These variations are common in human language, but the model fails on them.
Figure 2: Density map of four datasets' vectorized syntax structure principal features. Darker areas indicate a higher concentration of similar sentence patterns. A concentrated dark region suggests that the dataset consists of simple and less diverse sentence structures. More details in Suppl.
Figure 3: Review of popular model architectures in 3D-VL. (a) illustrates traditional dual-stream models, while (b) presents the recently proposed LLM-based architecture.
Figure 4: The overall rephrase process to build our proposed evaluation suite and the abstract prompt design. We prompt gpt to paraphrase the original sentence into five different styles derived from human natural language characteristics. One example from ScanRefer chen2020scanrefer is shown in different styles.
Figure 5: Performance summary of existing models on our 3D Language Robustness benchmark. Listening accuracy (Acc) for NR3D achlioptas2020referit3d and ScanRefer-GT, Acc@kIoU for ScanRefer chen2020scanrefer, and EM@1 for ScanQA azuma2022scanqa are measured. We calculate average robustness by averaging its performance across all five language variant splits. It shows performance drops in 3D-QA and 3D-VG models, indicating a lack of robustness.
...and 11 more figures

Can 3D Vision-Language Models Truly Understand Natural Language?

TL;DR

Abstract

Can 3D Vision-Language Models Truly Understand Natural Language?

Authors

TL;DR

Abstract

Table of Contents

Figures (16)