Table of Contents
Fetching ...

LSceneLLM: Enhancing Large 3D Scene Understanding Using Adaptive Visual Preferences

Hongyan Zhi, Peihao Chen, Junyan Li, Shuailei Ma, Xinyu Sun, Tianhang Xiang, Yinjie Lei, Mingkui Tan, Chuang Gan

TL;DR

This work tackles large-scale 3D scene understanding by enabling models to focus on task-relevant regions rather than processing all scene details. It introduces LSceneLLM, which combines a coarse scene encoder with a scene magnifier that selectively injects dense, fine-grained visual tokens guided by the LLM’s attention, via an adaptive self-attention fusion. The approach, validated on the XR-Scene cross-room benchmark and indoor/outdoor datasets, achieves state-of-the-art results across QA, embodied planning, and scene caption tasks, and demonstrates strong plug-and-play transfer to existing 3D-VLMs. The XR-Scene benchmark further provides a challenging evaluation suite for large-scale, cross-room understanding, highlighting practical implications for embodied AI in real-world environments.

Abstract

Research on 3D Vision-Language Models (3D-VLMs) is gaining increasing attention, which is crucial for developing embodied AI within 3D scenes, such as visual navigation and embodied question answering. Due to the high density of visual features, especially in large 3D scenes, accurately locating task-relevant visual information is challenging. Existing works attempt to segment all objects and consider their features as scene representations. However, these task-agnostic object features include much redundant information and missing details for the task-relevant area. To tackle these problems, we propose LSceneLLM, an adaptive framework that automatically identifies task-relevant areas by leveraging LLM's visual preference for different tasks, followed by a plug-and-play scene magnifier module to capture fine-grained details in focused areas. Specifically, a dense token selector examines the attention map of LLM to identify visual preferences for the instruction input. It then magnifies fine-grained details of the focusing area. An adaptive self-attention module is leveraged to fuse the coarse-grained and selected fine-grained visual information. To comprehensively evaluate the large scene understanding ability of 3D-VLMs, we further introduce a cross-room understanding benchmark, XR-Scene, which contains a series of large scene understanding tasks including XR-QA, XR-EmbodiedPlanning, and XR-SceneCaption. Experiments show that our method surpasses existing methods on both large scene understanding and existing scene understanding benchmarks. Plunging our scene magnifier module into the existing 3D-VLMs also brings significant improvement.

LSceneLLM: Enhancing Large 3D Scene Understanding Using Adaptive Visual Preferences

TL;DR

This work tackles large-scale 3D scene understanding by enabling models to focus on task-relevant regions rather than processing all scene details. It introduces LSceneLLM, which combines a coarse scene encoder with a scene magnifier that selectively injects dense, fine-grained visual tokens guided by the LLM’s attention, via an adaptive self-attention fusion. The approach, validated on the XR-Scene cross-room benchmark and indoor/outdoor datasets, achieves state-of-the-art results across QA, embodied planning, and scene caption tasks, and demonstrates strong plug-and-play transfer to existing 3D-VLMs. The XR-Scene benchmark further provides a challenging evaluation suite for large-scale, cross-room understanding, highlighting practical implications for embodied AI in real-world environments.

Abstract

Research on 3D Vision-Language Models (3D-VLMs) is gaining increasing attention, which is crucial for developing embodied AI within 3D scenes, such as visual navigation and embodied question answering. Due to the high density of visual features, especially in large 3D scenes, accurately locating task-relevant visual information is challenging. Existing works attempt to segment all objects and consider their features as scene representations. However, these task-agnostic object features include much redundant information and missing details for the task-relevant area. To tackle these problems, we propose LSceneLLM, an adaptive framework that automatically identifies task-relevant areas by leveraging LLM's visual preference for different tasks, followed by a plug-and-play scene magnifier module to capture fine-grained details in focused areas. Specifically, a dense token selector examines the attention map of LLM to identify visual preferences for the instruction input. It then magnifies fine-grained details of the focusing area. An adaptive self-attention module is leveraged to fuse the coarse-grained and selected fine-grained visual information. To comprehensively evaluate the large scene understanding ability of 3D-VLMs, we further introduce a cross-room understanding benchmark, XR-Scene, which contains a series of large scene understanding tasks including XR-QA, XR-EmbodiedPlanning, and XR-SceneCaption. Experiments show that our method surpasses existing methods on both large scene understanding and existing scene understanding benchmarks. Plunging our scene magnifier module into the existing 3D-VLMs also brings significant improvement.

Paper Structure

This paper contains 29 sections, 4 equations, 7 figures, 11 tables.

Figures (7)

  • Figure 1: We propose LSceneLLM, a novel framework for adaptive large 3D scene understanding. (a) Existing methods struggle to locate task-relevant visual information when facing large scenes. (b) We are committed to precisely identifying fine-grain task-related visual features through adaptive scene modeling. (c) Our method outperforms existing approaches across various benchmarks.
  • Figure 2: An Overview of LSceneLLM. LSceneLLM first perceives the scene through sparse vision tokens at the coarse level and then enhances regions of interest using dense vision tokens. Our method can effectively handle various visual language tasks in large scenes.
  • Figure 3: Illustration of Adaptive Self-attention Module and Dense Vision Token Selector. We first obtain the focused regions by analyzing the attention map of LLM. Then we extract dense point cloud features from the region of interest and parse dense vision tokens through sampling and grouping operations.
  • Figure 4: Examples of dataset XR-Scene. XR-Scene contains three cross-room scene benchmarks that comprehensively evaluate different understanding abilities.
  • Figure 5: Visualization of attention map of LLM. Red represents high activation values, while blue represents low activation values.
  • ...and 2 more figures