Retrieval-augmented Multi-modal Chain-of-Thoughts Reasoning for Large Language Models

Bingshuai Liu; Chenyang Lyu; Zijun Min; Zhanyu Wang; Jinsong Su; Longyue Wang

Retrieval-augmented Multi-modal Chain-of-Thoughts Reasoning for Large Language Models

Bingshuai Liu, Chenyang Lyu, Zijun Min, Zhanyu Wang, Jinsong Su, Longyue Wang

TL;DR

This work tackles the challenge of selecting informative demonstration examples for multi-modal Chain-of-Thought reasoning in LLMs. It introduces MM-Retrieval, a retrieval-augmented framework that jointly exploits cross-modal and intra-modal similarities to dynamically retrieve demonstrations and uses Stratified Sampling to diversify examples, while incorporating visual signals through captioning and OCR. Empirical results on ScienceQA and MathVista show state-of-the-art gains across GPT-4, GPT-4V, and ChatGPT, with substantial improvements in average accuracy and robust ablations validating the contribution of each component. The approach holds practical significance for advancing multi-modal reasoning in LLMs and guides future work toward broader tasks and modalities, as well as reducing hallucination through better demonstration selection.

Abstract

The advancement of Large Language Models (LLMs) has brought substantial attention to the Chain of Thought (CoT) approach, primarily due to its ability to enhance the capability of LLMs on complex reasoning tasks. Moreover, the significance of CoT approaches extends to the application of LLMs for multi-modal tasks. However, the selection of optimal CoT demonstration examples in multi-modal reasoning remains less explored for LLMs due to the inherent complexity of multi-modal examples. In this paper, we introduce a novel approach that addresses this challenge by using retrieval mechanisms to dynamically and automatically select demonstration examples based on cross-modal and intra-modal similarities. Furthermore, we employ a Stratified Sampling method of categorising demonstration examples into groups based on their types and then retrieving examples from different groups respectively to promote the diversity of demonstration examples. Through a series of experiments on two popular benchmark datasets: ScienceQA and MathVista, we demonstrate that our approach significantly improves the performance of GPT-4 by 6% on ScienceQA and 12.9% on MathVista, and enhances the performance of GPT-4V on two datasets by 2.7%, substantially improving the performance of the most advanced LLMs and LMMs for complex multi-modal reasoning tasks.

Retrieval-augmented Multi-modal Chain-of-Thoughts Reasoning for Large Language Models

TL;DR

Abstract

Paper Structure (22 sections, 4 equations, 8 figures, 4 tables)

This paper contains 22 sections, 4 equations, 8 figures, 4 tables.

Introduction
Related Work
Retrieval-Augmented Generation for LLMs
In-Context Learning
Chain-of-Thought Reasoning
Methodology
Incorporation of Visual Information to LLMs
Image Captioning
Optical Character Recognition
Retrieval Mechanism
Sampling Method
Final Prediction
Experiments
Experimental Setup
Datasets
...and 7 more sections

Figures (8)

Figure 1: Our MM-Retrieval approach dynamically retrieves demonstrations based on the question. Compared with CoT, it has better adaptability and can stimulate the reasoning ability of LLMs. The red $D_1$, $D_2$ represent demonstrations retrieved based on the question, while the blue $D1$, $D2$ represent the fixed samples regardless of the question.
Figure 2: Results on different categories of ScienceQA lu2022learn and MathVista lu2023mathvista. Our proposed approach obtains substantial improvements over previous baseline models including CoTlu2023chameleon, PoTlu2023mathvista and Chameleonlu2023chameleon on GPT-4 foundation models.
Figure 3: An overview of our proposed multi-modal retrieval method. We employ both cross-modality retrieval and intra-modality retrieval (text-modal and image-modal retrieval), to obtain relevant examples as retrieved demonstrations from demonstration pool. Then, these retrieved demonstrations are integrated with prompt and test question, serving as the input for LLMs.
Figure 4: A detailed illustration of our multi-modal retrieval approach, where we use intra-modal similarity and cross-modal similarity to sample demonstration examples $\boldsymbol{D}$ from demonstration pool $\boldsymbol{Q}$.
Figure 5: Ablation study of four retrieval methods: Text-to-Text Retrieval, Text-to-Image Retrieval, Image-to-Text Retrieval, Image-to-Image Retrieval on ScienceQA (upper) and MathVista (bottom). We inspect the performance of each retrieval approach under different amount of demonstration examples.
...and 3 more figures

Retrieval-augmented Multi-modal Chain-of-Thoughts Reasoning for Large Language Models

TL;DR

Abstract

Retrieval-augmented Multi-modal Chain-of-Thoughts Reasoning for Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (8)