Towards Comprehensive Scene Understanding: Integrating First and Third-Person Views for LVLMs

Insu Lee; Wooje Park; Jaeyun Jang; Minyoung Noh; Kyuhong Shim; Byonghyo Shim

Towards Comprehensive Scene Understanding: Integrating First and Third-Person Views for LVLMs

Insu Lee, Wooje Park, Jaeyun Jang, Minyoung Noh, Kyuhong Shim, Byonghyo Shim

TL;DR

This work tackles the limits of egocentric visual inputs for large vision-language models by introducing the E3VQA benchmark, which evaluates joint reasoning over synchronized ego- and exocentric views across four reasoning categories. It presents M3CoT, a training-free prompting scheme that fuses three scene graphs generated from different view combinations into a unified representation, enabling more robust cross-view reasoning. Empirical results show that M3CoT consistently improves over strong CoT baselines on GPT-4o and Gemini 2.0 Flash, with notable gains in numerical reasoning and strong generalization across datasets like LEMMA. By providing a rigorous benchmark and a scalable multi-view prompting framework, the work advances comprehensive scene understanding for context-aware visual assistants and embodied AI systems.

Abstract

Large vision-language models (LVLMs) are increasingly deployed in interactive applications such as virtual and augmented reality, where a first-person (egocentric) view captured by head-mounted cameras serves as key input. While this view offers fine-grained cues about user attention and hand-object interactions, its narrow field of view and lack of global context often lead to failures on spatially or contextually demanding queries. To address this, we introduce a framework that augments egocentric inputs with third-person (exocentric) views, providing complementary information such as global scene layout and object visibility to LVLMs. We present E3VQA, the first benchmark for multi-view question answering with 4K high-quality question-answer pairs grounded in synchronized ego-exo image pairs. Additionally, we propose M3CoT, a training-free prompting technique that constructs a unified scene representation by integrating scene graphs from three complementary perspectives. M3CoT enables LVLMs to reason more effectively across views, yielding consistent performance gains (4.84% for GPT-4o and 5.94% for Gemini 2.0 Flash) over a recent CoT baseline. Our extensive evaluation reveals key strengths and limitations of LVLMs in multi-view reasoning and highlights the value of leveraging both egocentric and exocentric inputs. The dataset and source code are available at https://github.com/Leeinsu1/Towards-Comprehensive-Scene-Understanding.

Towards Comprehensive Scene Understanding: Integrating First and Third-Person Views for LVLMs

TL;DR

Abstract

Towards Comprehensive Scene Understanding: Integrating First and Third-Person Views for LVLMs

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (45)