Table of Contents
Fetching ...

Multimodal Hypothetical Summary for Retrieval-based Multi-image Question Answering

Peize Li, Qingyi Si, Peng Fu, Zheng Lin, Yan Wang

TL;DR

This work tackles retrieval-based multi-image QA by addressing the misalignment between retrieval and QA in traditional pipelines. It introduces Multimodal Hypothetical Summary (MHyS), which replaces real images with question-form and description-form textual summaries, enabling text-to-text retrieval and end-to-end optimization with a contrastive enhancement loss and VQA loss. The method combines sentence-level CLIP-based matching with word-level multimodal encoding (VL-BART) in a coarse-to-fine retrieval framework, selecting a final set of candidate images to augment VQA. Empirical results on RETVQA show clear gains over state-of-the-art two-stage methods and substantial improvements over CLIP-based baselines, with extensive ablations validating the contributions of MHyS, multi-granularity retrieval, and the training objectives. Overall, MHyS demonstrates that transforming cross-modal retrieval into text-based matching can substantially improve accuracy and robustness in retrieval-based multi-image QA, with practical implications for scalable multimodal QA systems.

Abstract

Retrieval-based multi-image question answering (QA) task involves retrieving multiple question-related images and synthesizing these images to generate an answer. Conventional "retrieve-then-answer" pipelines often suffer from cascading errors because the training objective of QA fails to optimize the retrieval stage. To address this issue, we propose a novel method to effectively introduce and reference retrieved information into the QA. Given the image set to be retrieved, we employ a multimodal large language model (visual perspective) and a large language model (textual perspective) to obtain multimodal hypothetical summary in question-form and description-form. By combining visual and textual perspectives, MHyS captures image content more specifically and replaces real images in retrieval, which eliminates the modality gap by transforming into text-to-text retrieval and helps improve retrieval. To more advantageously introduce retrieval with QA, we employ contrastive learning to align queries (questions) with MHyS. Moreover, we propose a coarse-to-fine strategy for calculating both sentence-level and word-level similarity scores, to further enhance retrieval and filter out irrelevant details. Our approach achieves a 3.7% absolute improvement over state-of-the-art methods on RETVQA and a 14.5% improvement over CLIP. Comprehensive experiments and detailed ablation studies demonstrate the superiority of our method.

Multimodal Hypothetical Summary for Retrieval-based Multi-image Question Answering

TL;DR

This work tackles retrieval-based multi-image QA by addressing the misalignment between retrieval and QA in traditional pipelines. It introduces Multimodal Hypothetical Summary (MHyS), which replaces real images with question-form and description-form textual summaries, enabling text-to-text retrieval and end-to-end optimization with a contrastive enhancement loss and VQA loss. The method combines sentence-level CLIP-based matching with word-level multimodal encoding (VL-BART) in a coarse-to-fine retrieval framework, selecting a final set of candidate images to augment VQA. Empirical results on RETVQA show clear gains over state-of-the-art two-stage methods and substantial improvements over CLIP-based baselines, with extensive ablations validating the contributions of MHyS, multi-granularity retrieval, and the training objectives. Overall, MHyS demonstrates that transforming cross-modal retrieval into text-based matching can substantially improve accuracy and robustness in retrieval-based multi-image QA, with practical implications for scalable multimodal QA systems.

Abstract

Retrieval-based multi-image question answering (QA) task involves retrieving multiple question-related images and synthesizing these images to generate an answer. Conventional "retrieve-then-answer" pipelines often suffer from cascading errors because the training objective of QA fails to optimize the retrieval stage. To address this issue, we propose a novel method to effectively introduce and reference retrieved information into the QA. Given the image set to be retrieved, we employ a multimodal large language model (visual perspective) and a large language model (textual perspective) to obtain multimodal hypothetical summary in question-form and description-form. By combining visual and textual perspectives, MHyS captures image content more specifically and replaces real images in retrieval, which eliminates the modality gap by transforming into text-to-text retrieval and helps improve retrieval. To more advantageously introduce retrieval with QA, we employ contrastive learning to align queries (questions) with MHyS. Moreover, we propose a coarse-to-fine strategy for calculating both sentence-level and word-level similarity scores, to further enhance retrieval and filter out irrelevant details. Our approach achieves a 3.7% absolute improvement over state-of-the-art methods on RETVQA and a 14.5% improvement over CLIP. Comprehensive experiments and detailed ablation studies demonstrate the superiority of our method.

Paper Structure

This paper contains 24 sections, 18 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: An illustration of our motivation. Compared to the "retrieve-then-answer" pipeline, our approach leverages multimodal hypothetical summary (MHyS) to transform cross-modality retrieval into text-to-text retrieval, effectively introducing and referencing retrieval into QA.
  • Figure 2: The overview of our approach. Multimodal Hypothetical Summary (MHyS) employs multimodal large language model (visual perspective) and language large model (textual perspective) to obtain both question-form and description-form hypothetical summary, which replaces real images during retrieval and eliminates the modality gap by transforming into text-to-text retrieval. Multi-granularity Retrieval calculates sentence-level and word-level similarities to rank images. To capture more information, the selected real images (based on similarity scores) are combined with their MHyS to generate the answers.
  • Figure 3: Performance comparison of our method with various baselines under different numbers of retrieved images.
  • Figure 4: Performance with different question types.
  • Figure 5: Qualitative comparison between our method and the baseline.