Table of Contents
Fetching ...

Mixture of Rationale: Multi-Modal Reasoning Mixture for Visual Question Answering

Tao Li, Linjun Shou, Xuejun Liu

TL;DR

This work tackles zero-shot visual question answering by introducing Mixture of Rationales (MoR), a method that uses a single frozen Vision-and-Language Pre-trained Model to generate, retrieve, and fuse multiple multimodal rationales. By generating diverse rationales from triggering prompts and performing in-model encoding, dynamic retrieval, and Fusion-in-Decoder, MoR achieves substantial gains on NLVR2 and OKVQA-S, demonstrating improved multi-modal reasoning without fine-tuning. The approach reveals that diversity in rationales, effective retrieval strategies, and FiD-based fusion collectively enhance zero-shot VQA performance, with notable improvements across backbone models. The findings suggest practical impact for deploying robust multimodal reasoning systems using a unified VLPM backbone while leveraging rationale diversity and retrieval-based fusion.

Abstract

Zero-shot visual question answering (VQA) is a challenging task that requires reasoning across modalities. While some existing methods rely on a single rationale within the Chain of Thoughts (CoT) framework, they may fall short of capturing the complexity of the VQA problem. On the other hand, some other methods that use multiple rationales may still suffer from low diversity, poor modality alignment, and inefficient retrieval and fusion. In response to these challenges, we propose \emph{Mixture of Rationales (MoR)}, a novel multi-modal reasoning method that mixes multiple rationales for VQA. MoR uses a single frozen Vision-and-Language Pre-trained Models (VLPM) model to {dynamically generate, retrieve and fuse multi-modal thoughts}. We evaluate MoR on two challenging VQA datasets, i.e. NLVR2 and OKVQA, with two representative backbones OFA and VL-T5. MoR achieves a 12.43\% accuracy improvement on NLVR2, and a 2.45\% accuracy improvement on OKVQA-S( the science and technology category of OKVQA).

Mixture of Rationale: Multi-Modal Reasoning Mixture for Visual Question Answering

TL;DR

This work tackles zero-shot visual question answering by introducing Mixture of Rationales (MoR), a method that uses a single frozen Vision-and-Language Pre-trained Model to generate, retrieve, and fuse multiple multimodal rationales. By generating diverse rationales from triggering prompts and performing in-model encoding, dynamic retrieval, and Fusion-in-Decoder, MoR achieves substantial gains on NLVR2 and OKVQA-S, demonstrating improved multi-modal reasoning without fine-tuning. The approach reveals that diversity in rationales, effective retrieval strategies, and FiD-based fusion collectively enhance zero-shot VQA performance, with notable improvements across backbone models. The findings suggest practical impact for deploying robust multimodal reasoning systems using a unified VLPM backbone while leveraging rationale diversity and retrieval-based fusion.

Abstract

Zero-shot visual question answering (VQA) is a challenging task that requires reasoning across modalities. While some existing methods rely on a single rationale within the Chain of Thoughts (CoT) framework, they may fall short of capturing the complexity of the VQA problem. On the other hand, some other methods that use multiple rationales may still suffer from low diversity, poor modality alignment, and inefficient retrieval and fusion. In response to these challenges, we propose \emph{Mixture of Rationales (MoR)}, a novel multi-modal reasoning method that mixes multiple rationales for VQA. MoR uses a single frozen Vision-and-Language Pre-trained Models (VLPM) model to {dynamically generate, retrieve and fuse multi-modal thoughts}. We evaluate MoR on two challenging VQA datasets, i.e. NLVR2 and OKVQA, with two representative backbones OFA and VL-T5. MoR achieves a 12.43\% accuracy improvement on NLVR2, and a 2.45\% accuracy improvement on OKVQA-S( the science and technology category of OKVQA).
Paper Structure (29 sections, 7 figures, 6 tables)

This paper contains 29 sections, 7 figures, 6 tables.

Figures (7)

  • Figure 1: A typical zero-shot visual question answering problem requires generating, retrieving and fusing multi-modal thoughts.
  • Figure 2: Diagram of MoR, which is based on any frozen encoder-decoder VLPM. The first module produces rationales from the input of triggering prompts, questions, and image(s). The second module performs encoding, retrieval and fusion in one pass.
  • Figure 3: Diversity of rationales for NLVR2 task. The figure shows the average cosine similarity between each pair of rationales. Similarities are computed using OpenAI’s text-embedding-ada-002 model openai2022textembeddingada. It can be observed that there are two clusters that have high intra-similarity (lighter color) and low inter-similarity(darker color), indicating diversity. The index of the rationales can be found in \ref{['sec:rationale_index']}.
  • Figure 4: Similarities between problems and rationales. The blue line shows how similar each rationale is to the problem, as measured by cosine similarity using the text-embedding-ada-002 model. The red line indicates the similarity between the problem and the average of all previous rationales. The rationales are numbered in \ref{['sec:rationale_index']}.
  • Figure 5: The effect of varying the number of thoughts, that are dynamically retrieved, on our model’s performance for the OKVQA-S dataset.
  • ...and 2 more figures