OpenView: Empowering MLLMs with Out-of-view VQA

Qixiang Chen; Cheng Zhang; Chi-Wing Fu; Jingwen Ye; Jianfei Cai

OpenView: Empowering MLLMs with Out-of-view VQA

Qixiang Chen, Cheng Zhang, Chi-Wing Fu, Jingwen Ye, Jianfei Cai

TL;DR

This work introduces out-of-view (OOV) understanding for multimodal models and presents OpenView, a four-stage panorama-based data synthesis pipeline that automatically creates large-scale OOV VQA data. It yields OpenView-Dataset (over 158k questions from 16k panoramas) and OpenView-Bench (1,327 validated OOV questions) to train and evaluate reasoning about content beyond the visible frame. Empirical results show that finetuning multiple MLLMs with OpenView data substantially improves both answer and rationale quality, though a gap to human performance remains in truly extrapolating unseen contexts. The study provides a foundation for advancing spatial reasoning in visual-language systems and suggests directions toward extending OOV capabilities to video and world-modeling tasks.

Abstract

Recent multimodal large language models (MLLMs) show great potential in natural image understanding. Yet, they perform well, mainly on reasoning in-view contents within the image frame. This paper presents the first study on out-of-view (OOV) understanding, i.e., the ability to reason objects, activities, and scenes beyond the visible frame of a perspective view. Our technical contributions are threefold. First, we design OpenView, a four-stage pipeline to massively generate multi-choice VQA by leveraging panoramic imagery to enable context-rich and spatial-grounded VQA synthesis with free-view framing. Second, we curate OpenView-Dataset, a high-quality synthetic dataset from diverse real-world panoramas to empower MLLMs upon supervised fine-tuning. Third, we build OpenView-Bench, a benchmark that jointly measures choice and rationale accuracy for interpretable and diagnosable evaluation. Experimental results show that despite having a large gap from human performance in OOV VQA answer selection, upon empowered by OpenView, multiple MLLMs can consistently boost their performance, uplifted from 48.6% to 64.1% on average. Code, benchmark, and data will be available at https://github.com/q1xiangchen/OpenView.

OpenView: Empowering MLLMs with Out-of-view VQA

TL;DR

Abstract

OpenView: Empowering MLLMs with Out-of-view VQA

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (26)