Table of Contents
Fetching ...

Retrieval Meets Reasoning: Even High-school Textbook Knowledge Benefits Multimodal Reasoning

Cheng Tan, Jingxuan Wei, Linzhuang Sun, Zhangyang Gao, Siyuan Li, Bihui Yu, Ruifeng Guo, Stan Z. Li

TL;DR

This work addresses the gap in multimodal reasoning for retrieval-augmented generation by introducing RMR, a training-free framework that uses a bi-modality CLIP-based retrieval module to fetch high-school QA reasoning from a ScienceQA-derived knowledge library. Retrieved QRA triplets are organized into a structured context that guides in-context reasoning, enabling vision-language models to reason rather than merely reproduce answers. Empirical results demonstrate substantial improvements across multiple benchmarks (ScienceQA, A-OKVQA, MMBench, SEED-Bench) and backbone models, including large gains for Gemini and notable gains across other architectures without fine-tuning. The approach highlights the potential of modular, retrieval-guided reasoning to boost multimodal capabilities with minimal training, though limitations include library coverage and potential biases and energy considerations for inference.

Abstract

Large language models equipped with retrieval-augmented generation (RAG) represent a burgeoning field aimed at enhancing answering capabilities by leveraging external knowledge bases. Although the application of RAG with language-only models has been extensively explored, its adaptation into multimodal vision-language models remains nascent. Going beyond mere answer generation, the primary goal of multimodal RAG is to cultivate the models' ability to reason in response to relevant queries. To this end, we introduce a novel multimodal RAG framework named RMR (Retrieval Meets Reasoning). The RMR framework employs a bi-modal retrieval module to identify the most relevant question-answer pairs, which then serve as scaffolds for the multimodal reasoning process. This training-free approach not only encourages the model to engage deeply with the reasoning processes inherent in the retrieved content but also facilitates the generation of answers that are precise and richly interpretable. Surprisingly, utilizing solely the ScienceQA dataset, collected from elementary and high school science curricula, RMR significantly boosts the performance of various vision-language models across a spectrum of benchmark datasets, including A-OKVQA, MMBench, and SEED. These outcomes highlight the substantial potential of our multimodal retrieval and reasoning mechanism to improve the reasoning capabilities of vision-language models.

Retrieval Meets Reasoning: Even High-school Textbook Knowledge Benefits Multimodal Reasoning

TL;DR

This work addresses the gap in multimodal reasoning for retrieval-augmented generation by introducing RMR, a training-free framework that uses a bi-modality CLIP-based retrieval module to fetch high-school QA reasoning from a ScienceQA-derived knowledge library. Retrieved QRA triplets are organized into a structured context that guides in-context reasoning, enabling vision-language models to reason rather than merely reproduce answers. Empirical results demonstrate substantial improvements across multiple benchmarks (ScienceQA, A-OKVQA, MMBench, SEED-Bench) and backbone models, including large gains for Gemini and notable gains across other architectures without fine-tuning. The approach highlights the potential of modular, retrieval-guided reasoning to boost multimodal capabilities with minimal training, though limitations include library coverage and potential biases and energy considerations for inference.

Abstract

Large language models equipped with retrieval-augmented generation (RAG) represent a burgeoning field aimed at enhancing answering capabilities by leveraging external knowledge bases. Although the application of RAG with language-only models has been extensively explored, its adaptation into multimodal vision-language models remains nascent. Going beyond mere answer generation, the primary goal of multimodal RAG is to cultivate the models' ability to reason in response to relevant queries. To this end, we introduce a novel multimodal RAG framework named RMR (Retrieval Meets Reasoning). The RMR framework employs a bi-modal retrieval module to identify the most relevant question-answer pairs, which then serve as scaffolds for the multimodal reasoning process. This training-free approach not only encourages the model to engage deeply with the reasoning processes inherent in the retrieved content but also facilitates the generation of answers that are precise and richly interpretable. Surprisingly, utilizing solely the ScienceQA dataset, collected from elementary and high school science curricula, RMR significantly boosts the performance of various vision-language models across a spectrum of benchmark datasets, including A-OKVQA, MMBench, and SEED. These outcomes highlight the substantial potential of our multimodal retrieval and reasoning mechanism to improve the reasoning capabilities of vision-language models.
Paper Structure (26 sections, 8 equations, 9 figures, 4 tables)

This paper contains 26 sections, 8 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: limitations of multimodal retrieval enhancement with simple question-answer pairs.
  • Figure 2: The overall architecture and the retrieval mechanism of the bi-modality retrieval module.
  • Figure 3: The reasoning process from the retrieved content. The model uses the organized context from retrieved question-rationale-answer triplets to generate answers.
  • Figure 4: Comparative performance of Gemini and Gemini+RMR on the MMBench-Dev and MMBench-Test datasets.
  • Figure 5: Performance on SEED-Bench dataset.
  • ...and 4 more figures