Table of Contents
Fetching ...

More than the Sum: Panorama-Language Models for Adverse Omni-Scenes

Weijia Fan, Ruiping Liu, Jiale Wei, Yufan Chen, Junwei Zheng, Zichao Zeng, Jiaming Zhang, Qiufu Li, Linlin Shen, Rainer Stiefelhagen

TL;DR

The Panorama-Language Modeling (PLM) Paradigm is introduced, a unified vision-language reasoning that is more than the sum of its pinhole counterparts, and a plug-and-play panoramic sparse attention module is developed that allows existing pinhole-based VLMs to process equirectangular panoramas without retraining.

Abstract

Existing vision-language models (VLMs) are tailored for pinhole imagery, stitching multiple narrow field-of-view inputs to piece together a complete omni-scene understanding. Yet, such multi-view perception overlooks the holistic spatial and contextual relationships that a single panorama inherently preserves. In this work, we introduce the Panorama-Language Modeling (PLM)paradigm, a unified $360^\circ$ vision-language reasoning that is more than the sum of its pinhole counterparts. Besides, we present PanoVQA, a large-scale panoramic VQA dataset that involves adverse omni-scenes, enabling comprehensive reasoning under object occlusions and driving accidents. To establish a foundation for PLM, we develop a plug-and-play panoramic sparse attention module that allows existing pinhole-based VLMs to process equirectangular panoramas without retraining. Extensive experiments demonstrate that our PLM achieves superior robustness and holistic reasoning under challenging omni-scenes, yielding understanding greater than the sum of its narrow parts. Project page: https://github.com/InSAI-Lab/PanoVQA.

More than the Sum: Panorama-Language Models for Adverse Omni-Scenes

TL;DR

The Panorama-Language Modeling (PLM) Paradigm is introduced, a unified vision-language reasoning that is more than the sum of its pinhole counterparts, and a plug-and-play panoramic sparse attention module is developed that allows existing pinhole-based VLMs to process equirectangular panoramas without retraining.

Abstract

Existing vision-language models (VLMs) are tailored for pinhole imagery, stitching multiple narrow field-of-view inputs to piece together a complete omni-scene understanding. Yet, such multi-view perception overlooks the holistic spatial and contextual relationships that a single panorama inherently preserves. In this work, we introduce the Panorama-Language Modeling (PLM)paradigm, a unified vision-language reasoning that is more than the sum of its pinhole counterparts. Besides, we present PanoVQA, a large-scale panoramic VQA dataset that involves adverse omni-scenes, enabling comprehensive reasoning under object occlusions and driving accidents. To establish a foundation for PLM, we develop a plug-and-play panoramic sparse attention module that allows existing pinhole-based VLMs to process equirectangular panoramas without retraining. Extensive experiments demonstrate that our PLM achieves superior robustness and holistic reasoning under challenging omni-scenes, yielding understanding greater than the sum of its narrow parts. Project page: https://github.com/InSAI-Lab/PanoVQA.
Paper Structure (36 sections, 14 equations, 14 figures, 9 tables)

This paper contains 36 sections, 14 equations, 14 figures, 9 tables.

Figures (14)

  • Figure 1: Overview of Panorama-Language Modeling (PLM). (a) To enable PLM, we create the first PanoVQA dataset with 653K QA pairs, including normal (N), occluded (O), accidental (D) driving scenarios. (b) Compared to narrow-FoV multi-view VLMs, PLM with $360^\circ$ spatial semantic consistency can identify the potential risks (e.g., a van in the front-left). (c) Evaluating across PanoVQA, our proposed PLM significantly outperforms all other models across all categories, yielding superior omni-scene understanding.
  • Figure 2: 1-Pano (41.42$\%$) outperforms 6-Cam (40.22$\%$) on PanoVQA-mini. The panorama's seamless $360^\circ$ context is key for spatial awareness. As shown, the 6-cam model fails the query, e.g., misidentifying the direction. In contrast, the 1-Pano model leverages the full context to, e.g., correctly locate the object, matching the GT. More examples can be found in the supplementary.
  • Figure 3: Panorama generation overview. Following wei2024onebev, we center a viewing sphere at $O$. For each camera, an image plane $I_i$ is tangent to a concentric sphere at $M_i$ on its optical axis. For each panorama pixel $P_{360_{jk}}$, we cast a ray $l_{jk}$ from $O$, intersect it with the tangent planes, and sample the color at the projected image coordinate $N_{jk}$. Overlaps are resolved by a fixed camera order ("first hit wins"), ensuring consistency without feature matching.
  • Figure 4: Left: Structure of our proposed attention block with SWA and PSA. Right: The visualization of attention masks for Sliding Window Attention (SWA), Simplified Sparse Attention (SSA), and Panoramic Sparse Attention (PSA), respectively.
  • Figure 5: Analysis of the performance-parameter trade-off. Left: Impact of varying bottle dimensions. Right: Impact of varying the selected Top-K.
  • ...and 9 more figures