Table of Contents
Fetching ...

OpenView: Empowering MLLMs with Out-of-view VQA

Qixiang Chen, Cheng Zhang, Chi-Wing Fu, Jingwen Ye, Jianfei Cai

TL;DR

This work introduces out-of-view (OOV) understanding for multimodal models and presents OpenView, a four-stage panorama-based data synthesis pipeline that automatically creates large-scale OOV VQA data. It yields OpenView-Dataset (over 158k questions from 16k panoramas) and OpenView-Bench (1,327 validated OOV questions) to train and evaluate reasoning about content beyond the visible frame. Empirical results show that finetuning multiple MLLMs with OpenView data substantially improves both answer and rationale quality, though a gap to human performance remains in truly extrapolating unseen contexts. The study provides a foundation for advancing spatial reasoning in visual-language systems and suggests directions toward extending OOV capabilities to video and world-modeling tasks.

Abstract

Recent multimodal large language models (MLLMs) show great potential in natural image understanding. Yet, they perform well, mainly on reasoning in-view contents within the image frame. This paper presents the first study on out-of-view (OOV) understanding, i.e., the ability to reason objects, activities, and scenes beyond the visible frame of a perspective view. Our technical contributions are threefold. First, we design OpenView, a four-stage pipeline to massively generate multi-choice VQA by leveraging panoramic imagery to enable context-rich and spatial-grounded VQA synthesis with free-view framing. Second, we curate OpenView-Dataset, a high-quality synthetic dataset from diverse real-world panoramas to empower MLLMs upon supervised fine-tuning. Third, we build OpenView-Bench, a benchmark that jointly measures choice and rationale accuracy for interpretable and diagnosable evaluation. Experimental results show that despite having a large gap from human performance in OOV VQA answer selection, upon empowered by OpenView, multiple MLLMs can consistently boost their performance, uplifted from 48.6% to 64.1% on average. Code, benchmark, and data will be available at https://github.com/q1xiangchen/OpenView.

OpenView: Empowering MLLMs with Out-of-view VQA

TL;DR

This work introduces out-of-view (OOV) understanding for multimodal models and presents OpenView, a four-stage panorama-based data synthesis pipeline that automatically creates large-scale OOV VQA data. It yields OpenView-Dataset (over 158k questions from 16k panoramas) and OpenView-Bench (1,327 validated OOV questions) to train and evaluate reasoning about content beyond the visible frame. Empirical results show that finetuning multiple MLLMs with OpenView data substantially improves both answer and rationale quality, though a gap to human performance remains in truly extrapolating unseen contexts. The study provides a foundation for advancing spatial reasoning in visual-language systems and suggests directions toward extending OOV capabilities to video and world-modeling tasks.

Abstract

Recent multimodal large language models (MLLMs) show great potential in natural image understanding. Yet, they perform well, mainly on reasoning in-view contents within the image frame. This paper presents the first study on out-of-view (OOV) understanding, i.e., the ability to reason objects, activities, and scenes beyond the visible frame of a perspective view. Our technical contributions are threefold. First, we design OpenView, a four-stage pipeline to massively generate multi-choice VQA by leveraging panoramic imagery to enable context-rich and spatial-grounded VQA synthesis with free-view framing. Second, we curate OpenView-Dataset, a high-quality synthetic dataset from diverse real-world panoramas to empower MLLMs upon supervised fine-tuning. Third, we build OpenView-Bench, a benchmark that jointly measures choice and rationale accuracy for interpretable and diagnosable evaluation. Experimental results show that despite having a large gap from human performance in OOV VQA answer selection, upon empowered by OpenView, multiple MLLMs can consistently boost their performance, uplifted from 48.6% to 64.1% on average. Code, benchmark, and data will be available at https://github.com/q1xiangchen/OpenView.

Paper Structure

This paper contains 25 sections, 26 figures, 18 tables.

Figures (26)

  • Figure 1: Examples of out-of-view visual question answering (VQA) in a busy street. OpenView synthesizes multi-choice out-of-view VQA data, featuring both contextual and directional question types. Empowered with OpenView, various models can largely improve their performance (mid right vs. bottom right) on out-of-view VQA, making their capability closer to human performance (top right).
  • Figure 2: Left: Overview of the OpenView pipeline for multi-choice VQA generation. I. Panorama Annotation (Section \ref{['sec:pano_annotation']}) includes Stage 1, which collects and samples panoramic images with filtering, and Stage 2, the visual analyzer, which produces spatial-grounded captions for local patches and a comprehensive summary for each panorama. II. Multi-choice VQA Creation (Section \ref{['sec:mcvqa_generation']}) includes Stage 3, which generates multi-choice VQA proposals via view framing, question formulation, and answer elaboration using predefined prompt templates for contextual and directional OOV tasks, and Stage 4, which improves the synthesis quality through format refinement and confidence-based filtering, followed by an augmentation step that shuffles the options and jitters the view. Right: Overview of the 11 scene categories covered by the generated VQAs, featuring diverse functional and environmental characteristics.
  • Figure 3: An analysis on OpenView-Dataset. (a) Question patterns exhibit high linguistic consistency, for guiding the model to focus on visual understanding. (b) The option word cloud reveals diverse object and scene terms, whereas (c) the rationale word cloud, in contrast, highlights recurring reasoning phrases commonly-used across questions. (d) Scene category distribution across indoor and outdoor environments, illustrating the broad coverage of real-world scenarios represented in the panorama collection.
  • Figure 4: Example of outdoor Image&text-conditioned panorama generation results. We employ Qwen3-VL-8B-Instuct and its fine-tuned variant as assistants for adjacent-view description generation. Conditioned views are bounded in red. Objects are colorized to highlight repetitive or salient elements. For fair OOV comparison, we show generated descriptions for the left and right 90$^\circ$ views, which are non-overlapping yet closest to the conditioned image. The fine-tuned model provides more informative captions than base model, it also offers improved guidance for rear-facing views.
  • Figure 5: Additional statistical analysis of OpenView-Dataset. Word length distributions are shown for questions, options, and rationales.
  • ...and 21 more figures