Table of Contents
Fetching ...

Towards Comprehensive Scene Understanding: Integrating First and Third-Person Views for LVLMs

Insu Lee, Wooje Park, Jaeyun Jang, Minyoung Noh, Kyuhong Shim, Byonghyo Shim

TL;DR

This work tackles the limits of egocentric visual inputs for large vision-language models by introducing the E3VQA benchmark, which evaluates joint reasoning over synchronized ego- and exocentric views across four reasoning categories. It presents M3CoT, a training-free prompting scheme that fuses three scene graphs generated from different view combinations into a unified representation, enabling more robust cross-view reasoning. Empirical results show that M3CoT consistently improves over strong CoT baselines on GPT-4o and Gemini 2.0 Flash, with notable gains in numerical reasoning and strong generalization across datasets like LEMMA. By providing a rigorous benchmark and a scalable multi-view prompting framework, the work advances comprehensive scene understanding for context-aware visual assistants and embodied AI systems.

Abstract

Large vision-language models (LVLMs) are increasingly deployed in interactive applications such as virtual and augmented reality, where a first-person (egocentric) view captured by head-mounted cameras serves as key input. While this view offers fine-grained cues about user attention and hand-object interactions, its narrow field of view and lack of global context often lead to failures on spatially or contextually demanding queries. To address this, we introduce a framework that augments egocentric inputs with third-person (exocentric) views, providing complementary information such as global scene layout and object visibility to LVLMs. We present E3VQA, the first benchmark for multi-view question answering with 4K high-quality question-answer pairs grounded in synchronized ego-exo image pairs. Additionally, we propose M3CoT, a training-free prompting technique that constructs a unified scene representation by integrating scene graphs from three complementary perspectives. M3CoT enables LVLMs to reason more effectively across views, yielding consistent performance gains (4.84% for GPT-4o and 5.94% for Gemini 2.0 Flash) over a recent CoT baseline. Our extensive evaluation reveals key strengths and limitations of LVLMs in multi-view reasoning and highlights the value of leveraging both egocentric and exocentric inputs. The dataset and source code are available at https://github.com/Leeinsu1/Towards-Comprehensive-Scene-Understanding.

Towards Comprehensive Scene Understanding: Integrating First and Third-Person Views for LVLMs

TL;DR

This work tackles the limits of egocentric visual inputs for large vision-language models by introducing the E3VQA benchmark, which evaluates joint reasoning over synchronized ego- and exocentric views across four reasoning categories. It presents M3CoT, a training-free prompting scheme that fuses three scene graphs generated from different view combinations into a unified representation, enabling more robust cross-view reasoning. Empirical results show that M3CoT consistently improves over strong CoT baselines on GPT-4o and Gemini 2.0 Flash, with notable gains in numerical reasoning and strong generalization across datasets like LEMMA. By providing a rigorous benchmark and a scalable multi-view prompting framework, the work advances comprehensive scene understanding for context-aware visual assistants and embodied AI systems.

Abstract

Large vision-language models (LVLMs) are increasingly deployed in interactive applications such as virtual and augmented reality, where a first-person (egocentric) view captured by head-mounted cameras serves as key input. While this view offers fine-grained cues about user attention and hand-object interactions, its narrow field of view and lack of global context often lead to failures on spatially or contextually demanding queries. To address this, we introduce a framework that augments egocentric inputs with third-person (exocentric) views, providing complementary information such as global scene layout and object visibility to LVLMs. We present E3VQA, the first benchmark for multi-view question answering with 4K high-quality question-answer pairs grounded in synchronized ego-exo image pairs. Additionally, we propose M3CoT, a training-free prompting technique that constructs a unified scene representation by integrating scene graphs from three complementary perspectives. M3CoT enables LVLMs to reason more effectively across views, yielding consistent performance gains (4.84% for GPT-4o and 5.94% for Gemini 2.0 Flash) over a recent CoT baseline. Our extensive evaluation reveals key strengths and limitations of LVLMs in multi-view reasoning and highlights the value of leveraging both egocentric and exocentric inputs. The dataset and source code are available at https://github.com/Leeinsu1/Towards-Comprehensive-Scene-Understanding.

Paper Structure

This paper contains 52 sections, 45 figures, 10 tables.

Figures (45)

  • Figure 1: Conceptual illustration of example scenarios that require a joint understanding of egocentric (first-person) and exocentric (third-person) views. In each scenario, the first question can be answered using only the egocentric view, while the subsequent two questions require integrating information from both views. Yellow and gray overlays indicate egocentric and exocentric views, respectively.
  • Figure 2: Categories in the E3VQA benchmark. Each question is paired with ego-exo images and multiple-choice answers. The answers are highlighted in bold. The left part shows recognition categories, assessing the ability to focus on question-relevant parts. The right part shows reasoning categories, evaluating the ability to integrate information across views.
  • Figure 3: Overview of the E3VQA benchmark's three-step automated QA generation pipeline: (a) single-view QA generation step, (b) view-specific response expansion step, and (c) response-based question filtering step.
  • Figure 4: Overview of the M3CoT method. Left: Scene graph generation process from the Ego&Exo perspectives. Center: Scene graph generation process from the Ego2Exo perspective. Right: Scene graph generation process from the Exo2Ego perspective. Scene graphs from each perspective are merged to complement missing objects and relations, enabling the model to perform coherent reasoning and answer generation.
  • Figure 5: Analysis of the benchmark construction pipeline and model performance under varied input conditions: (a) error rate across option-generation strategies, (b) proportion of correctly answered questions between retained and excluded questions, and (c) performance across different visual input modalities.
  • ...and 40 more figures