Table of Contents
Fetching ...

PRIMA: Multi-Image Vision-Language Models for Reasoning Segmentation

Muntasir Wahed, Kiet A. Nguyen, Adheesh Sunil Juvekar, Xinzhuo Li, Xiaona Zhou, Vedant Shah, Tianjiao Yu, Pinar Yanardag, Ismini Lourentzou

TL;DR

This work defines multi-image pixel-grounded reasoning and introduces M4Seg, a large-scale dataset for cross-image grounding with pixel-level masks. It then presents PRIMA, an LVLM that leverages the SQuARE vision module to compute cross-image relational features and enable pixel-grounded reasoning across multiple images via a SAM-based decoder. Experimental results show that PRIMA surpasses both general-purpose and pixel-grounding baselines on segmentation and grounding metrics, highlighting the importance of explicit cross-image interactions for fine-grained reasoning. The combination of a dedicated cross-image encoder and open-world segmentation grounds natural language explanations in pixel-precise masks, enabling robust, context-aware multi-image understanding with practical implications for detailed visual analysis and comparison across scenes.

Abstract

Despite significant advancements in Large Vision-Language Models (LVLMs)' capabilities, existing pixel-grounding models operate in single-image settings, limiting their ability to perform detailed, fine-grained comparisons across multiple images. Conversely, current multi-image understanding models lack pixel-level grounding. Our work addresses this gap by introducing the task of multi-image pixel-grounded reasoning alongside PRIMA, an LVLM that integrates pixel-level grounding with robust multi-image reasoning to produce contextually rich, pixel-grounded explanations. Central to PRIMA is SQuARE, a vision module that injects cross-image relational context into compact query-based visual tokens before fusing them with the language backbone. To support training and evaluation, we curate M4SEG, a new multi-image reasoning segmentation benchmark consisting of $\sim$744K question-answer pairs that require fine-grained visual understanding across multiple images. PRIMA outperforms state-of-the-art baselines with $7.83\%$ and $11.25\%$ improvements in Recall and S-IoU, respectively. Ablation studies further demonstrate the effectiveness of the proposed SQuARE module in capturing cross-image relationships.

PRIMA: Multi-Image Vision-Language Models for Reasoning Segmentation

TL;DR

This work defines multi-image pixel-grounded reasoning and introduces M4Seg, a large-scale dataset for cross-image grounding with pixel-level masks. It then presents PRIMA, an LVLM that leverages the SQuARE vision module to compute cross-image relational features and enable pixel-grounded reasoning across multiple images via a SAM-based decoder. Experimental results show that PRIMA surpasses both general-purpose and pixel-grounding baselines on segmentation and grounding metrics, highlighting the importance of explicit cross-image interactions for fine-grained reasoning. The combination of a dedicated cross-image encoder and open-world segmentation grounds natural language explanations in pixel-precise masks, enabling robust, context-aware multi-image understanding with practical implications for detailed visual analysis and comparison across scenes.

Abstract

Despite significant advancements in Large Vision-Language Models (LVLMs)' capabilities, existing pixel-grounding models operate in single-image settings, limiting their ability to perform detailed, fine-grained comparisons across multiple images. Conversely, current multi-image understanding models lack pixel-level grounding. Our work addresses this gap by introducing the task of multi-image pixel-grounded reasoning alongside PRIMA, an LVLM that integrates pixel-level grounding with robust multi-image reasoning to produce contextually rich, pixel-grounded explanations. Central to PRIMA is SQuARE, a vision module that injects cross-image relational context into compact query-based visual tokens before fusing them with the language backbone. To support training and evaluation, we curate M4SEG, a new multi-image reasoning segmentation benchmark consisting of 744K question-answer pairs that require fine-grained visual understanding across multiple images. PRIMA outperforms state-of-the-art baselines with and improvements in Recall and S-IoU, respectively. Ablation studies further demonstrate the effectiveness of the proposed SQuARE module in capturing cross-image relationships.

Paper Structure

This paper contains 25 sections, 3 equations, 22 figures, 15 tables.

Figures (22)

  • Figure 1: We introduce the new task of multi-image pixel-grounded reasoning. To support this task, we curate M$^4$Seg, a benchmark providing question-answer (QA) pairs alongside image sets with pixel-level annotations. Additionally, we propose Prima, a model trained on M$^4$Seg, designed to efficiently identify and compare objects' contextual relationships across scenes. We focus on four key categories essential for multi-image understanding: functional, spatial, numerical, and open-ended reasoning.
  • Figure 2: (a) Performance comparison across different types of reasoning. (b) M$^4$Seg statistics. The five levels represent data distribution w.r.t. (1) annotation sources, (2) question type, (3) # unique objects and (4) # unique parts in images of a sample, and (5) # target masks in an answer.
  • Figure 3: Overview of the proposed Prima architecture. Leveraging a LoRA-finetuned language model, a novel SQuARE vision encoder (details in Fig. \ref{['fig:square']}), and a SAM-based decoder, Prima dynamically generates segmentation masks corresponding to objects referenced in natural language queries, supporting pixel-grounded reasoning in complex multi-image tasks.
  • Figure 4: Our proposed SQuARE module. Learnable relational queries attend over the concatenated multi-image features to form a shared relational representation. This representation is injected into the query pathway for global feature extraction, producing enriched visual representations that capture cross-image interactions.
  • Figure 5: M$^4$Seg distribution plots for (a) objects, (b) parts, and (c) segmentation targets. Here, (a) and (b) illustrate the frequency of the i-th object and i-th part, respectively, sorted by frequency, while (c) shows the percentage of answers containing $m$ masks.
  • ...and 17 more figures