Table of Contents
Fetching ...

LEO-VL: Efficient Scene Representation for Scalable 3D Vision-Language Learning

Jiangyong Huang, Xiaojian Ma, Xiongkun Linghu, Junchao He, Qing Li, Song-Chun Zhu, Yixin Chen, Baoxiong Jia, Siyuan Huang

Abstract

Developing vision-language models (VLMs) capable of understanding 3D scenes has been a longstanding research goal. Despite recent progress, 3D VLMs still struggle with spatial reasoning and robustness. We identify three key obstacles hindering their progress: (1) scene representation is constrained by a capacity-efficiency trade-off, which impedes scalable learning; (2) training data lacks a comprehensive scheme, with limited diversity across tasks and scene domains; and (3) models exhibit robustness deficiencies and lack effective post-training. To address these challenges, we first propose condensed feature grid (CFG), an efficient scene representation that significantly reduces token overhead while preserving strong perceptual capacity. Building on CFG, we introduce LEO-VL, a 3D VLM trained on over 700k 3D vision-language (3D-VL) data spanning four real-world indoor domains and five tasks such as captioning and dialogue. To further improve robustness, we propose SceneDPO, a novel post-training objective that incorporates contrastive signals across both answers and scenes. LEO-VL achieves state-of-the-art performance on various 3D-VL benchmarks, such as SQA3D, Beacon3D, and Scan2Cap. Extensive analyses highlight the efficiency of CFG and provide key insights such as the importance of task and scene diversity, the priority of data quality for effective scaling, and the advantages of SceneDPO.

LEO-VL: Efficient Scene Representation for Scalable 3D Vision-Language Learning

Abstract

Developing vision-language models (VLMs) capable of understanding 3D scenes has been a longstanding research goal. Despite recent progress, 3D VLMs still struggle with spatial reasoning and robustness. We identify three key obstacles hindering their progress: (1) scene representation is constrained by a capacity-efficiency trade-off, which impedes scalable learning; (2) training data lacks a comprehensive scheme, with limited diversity across tasks and scene domains; and (3) models exhibit robustness deficiencies and lack effective post-training. To address these challenges, we first propose condensed feature grid (CFG), an efficient scene representation that significantly reduces token overhead while preserving strong perceptual capacity. Building on CFG, we introduce LEO-VL, a 3D VLM trained on over 700k 3D vision-language (3D-VL) data spanning four real-world indoor domains and five tasks such as captioning and dialogue. To further improve robustness, we propose SceneDPO, a novel post-training objective that incorporates contrastive signals across both answers and scenes. LEO-VL achieves state-of-the-art performance on various 3D-VL benchmarks, such as SQA3D, Beacon3D, and Scan2Cap. Extensive analyses highlight the efficiency of CFG and provide key insights such as the importance of task and scene diversity, the priority of data quality for effective scaling, and the advantages of SceneDPO.

Paper Structure

This paper contains 50 sections, 6 equations, 9 figures, 13 tables.

Figures (9)

  • Figure 1: LEO-VL model design.LEO-VL extracts 2D visual features from multi-view RGB-D frames and transforms the features into a cfg, significantly reducing the token overhead while preserving 3D spatial structure. An llm performs auto-regressive language modeling based on the cfg tokens and language tokens.
  • Figure 2: Joint visualization of accuracy and efficiency. Accuracy is measured by exact-match (EM) accuracy on SQA3D, while efficiency is measured by scene token count. LEO-VL reaches a Pareto optimum between efficiency and accuracy.
  • Figure 3: Statistics of scene token count on ScanNet. Blue bars denote voxel tokens and green bars denote cfg tokens after vertical condensation.
  • Figure 4: Data scaling curve. Performance is measured by the average metrics in \ref{['tab:domain_ablation']}.
  • Figure 5: Ablation of position embeddings on the ScanNet subset. For more consistent visualization, we use the case-centric metric for Beacon3D, and averaged metrics for others.
  • ...and 4 more figures