PRIMA: Multi-Image Vision-Language Models for Reasoning Segmentation
Muntasir Wahed, Kiet A. Nguyen, Adheesh Sunil Juvekar, Xinzhuo Li, Xiaona Zhou, Vedant Shah, Tianjiao Yu, Pinar Yanardag, Ismini Lourentzou
TL;DR
This work defines multi-image pixel-grounded reasoning and introduces M4Seg, a large-scale dataset for cross-image grounding with pixel-level masks. It then presents PRIMA, an LVLM that leverages the SQuARE vision module to compute cross-image relational features and enable pixel-grounded reasoning across multiple images via a SAM-based decoder. Experimental results show that PRIMA surpasses both general-purpose and pixel-grounding baselines on segmentation and grounding metrics, highlighting the importance of explicit cross-image interactions for fine-grained reasoning. The combination of a dedicated cross-image encoder and open-world segmentation grounds natural language explanations in pixel-precise masks, enabling robust, context-aware multi-image understanding with practical implications for detailed visual analysis and comparison across scenes.
Abstract
Despite significant advancements in Large Vision-Language Models (LVLMs)' capabilities, existing pixel-grounding models operate in single-image settings, limiting their ability to perform detailed, fine-grained comparisons across multiple images. Conversely, current multi-image understanding models lack pixel-level grounding. Our work addresses this gap by introducing the task of multi-image pixel-grounded reasoning alongside PRIMA, an LVLM that integrates pixel-level grounding with robust multi-image reasoning to produce contextually rich, pixel-grounded explanations. Central to PRIMA is SQuARE, a vision module that injects cross-image relational context into compact query-based visual tokens before fusing them with the language backbone. To support training and evaluation, we curate M4SEG, a new multi-image reasoning segmentation benchmark consisting of $\sim$744K question-answer pairs that require fine-grained visual understanding across multiple images. PRIMA outperforms state-of-the-art baselines with $7.83\%$ and $11.25\%$ improvements in Recall and S-IoU, respectively. Ablation studies further demonstrate the effectiveness of the proposed SQuARE module in capturing cross-image relationships.
