Table of Contents
Fetching ...

PerLA: Perceptive 3D Language Assistant

Guofeng Mei, Wei Lin, Luigi Riz, Yujiao Wu, Fabio Poiesi, Yiming Wang

TL;DR

PerLA tackles the challenge of enabling 3D language models to perceive fine-grained geometry without exploding token counts. It introduces a perceptive 3D scene encoder that partitions a point cloud with Hilbert-curve serialization into local parts while preserving a global context, and fuses these representations via localized cross-attention and a Graph Convolutional Network. A local representation consensus loss stabilizes training and encourages object-consistent features during local-to-global aggregation. Demonstrated on ScanQA, ScanRefer, and Nr3D benchmarks, PerLA achieves state-of-the-art results for 3D question answering and dense captioning, suggesting substantial impact for robust, detail-rich 3D language understanding in real-world scenes.

Abstract

Enabling Large Language Models (LLMs) to understand the 3D physical world is an emerging yet challenging research direction. Current strategies for processing point clouds typically downsample the scene or divide it into smaller parts for separate analysis. However, both approaches risk losing key local details or global contextual information. In this paper, we introduce PerLA, a 3D language assistant designed to be more perceptive to both details and context, making visual representations more informative for the LLM. PerLA captures high-resolution (local) details in parallel from different point cloud areas and integrates them with (global) context obtained from a lower-resolution whole point cloud. We present a novel algorithm that preserves point cloud locality through the Hilbert curve and effectively aggregates local-to-global information via cross-attention and a graph neural network. Lastly, we introduce a novel loss for local representation consensus to promote training stability. PerLA outperforms state-of-the-art 3D language assistants, with gains of up to +1.34 CiDEr on ScanQA for question answering, and +4.22 on ScanRefer and +3.88 on Nr3D for dense captioning. https://gfmei.github.io/PerLA/

PerLA: Perceptive 3D Language Assistant

TL;DR

PerLA tackles the challenge of enabling 3D language models to perceive fine-grained geometry without exploding token counts. It introduces a perceptive 3D scene encoder that partitions a point cloud with Hilbert-curve serialization into local parts while preserving a global context, and fuses these representations via localized cross-attention and a Graph Convolutional Network. A local representation consensus loss stabilizes training and encourages object-consistent features during local-to-global aggregation. Demonstrated on ScanQA, ScanRefer, and Nr3D benchmarks, PerLA achieves state-of-the-art results for 3D question answering and dense captioning, suggesting substantial impact for robust, detail-rich 3D language understanding in real-world scenes.

Abstract

Enabling Large Language Models (LLMs) to understand the 3D physical world is an emerging yet challenging research direction. Current strategies for processing point clouds typically downsample the scene or divide it into smaller parts for separate analysis. However, both approaches risk losing key local details or global contextual information. In this paper, we introduce PerLA, a 3D language assistant designed to be more perceptive to both details and context, making visual representations more informative for the LLM. PerLA captures high-resolution (local) details in parallel from different point cloud areas and integrates them with (global) context obtained from a lower-resolution whole point cloud. We present a novel algorithm that preserves point cloud locality through the Hilbert curve and effectively aggregates local-to-global information via cross-attention and a graph neural network. Lastly, we introduce a novel loss for local representation consensus to promote training stability. PerLA outperforms state-of-the-art 3D language assistants, with gains of up to +1.34 CiDEr on ScanQA for question answering, and +4.22 on ScanRefer and +3.88 on Nr3D for dense captioning. https://gfmei.github.io/PerLA/

Paper Structure

This paper contains 28 sections, 11 equations, 6 figures, 12 tables, 2 algorithms.

Figures (6)

  • Figure 1: PerLA is a 3D language assistant that integrates local details with global context to learn informative representations of 3D scenes, whereas state-of-the-art (SOTA) 3DLAs focus solely on global context information. PerLA can provide more accurate responses, correctly distinguishing between objects such as a "black computer monitor” and a "black suitcase,” where SOTA models instead fail with hallucinated responses. Examples in figures show cases where capturing details from the point cloud matters for accurate output captions.
  • Figure 2: Overview of PerLA. (Left): The overall pipeline of PerLA, which begins by extracting interaction-aware 3D scene representations. These representations are then projected onto the prefix of textual instructions via MMA, serving as input to a frozen language model (LLM). (Right): The detailed design of PerLA. First, the 3D scene is divided into spatially compact regions using Hilbert-based scene serialization sagan1994hilbert. Next, an efficient $k$-NN algorithm associates each point-level global representation with its detail-enriched local representations, creating a comprehensive scene representation through a Graph Convolutional Network (GCN). Finally, smoothness and regularization losses are applied to promote stable learning for the proposed perceptive scene encoder.
  • Figure 3: The qualitative comparison between our method, PerLA, and LL3DA chen2024ll3da on the ScanQA azuma2022scanqa dataset shows that our approach achieves higher accuracy in responding to "what"-related questions.
  • Figure 4: Qualitative comparisons on the dense captioning task across the Nr3D achlioptas2020referit3d and ScanRefer chen2020scanrefer. We compare the results of our PerLA with LL3DA chen2024ll3da. PerLA generates accurate descriptions, effectively capturing fine-grained object attributes and spatial relationships.
  • Figure E: Visualization of two qualitative examples demonstrating scene partitioning using Hilbert-based serialization. The images illustrate the stepwise refinement of point cloud partitions, with each row corresponding to a different scene example. From left to right, the partitions (highlighted with brownish color) evolve as the serialization method groups spatially adjacent points.
  • ...and 1 more figures