Boosting the Power of Small Multimodal Reasoning Models to Match Larger Models with Self-Consistency Training

Cheng Tan; Jingxuan Wei; Zhangyang Gao; Linzhuang Sun; Siyuan Li; Ruifeng Guo; Bihui Yu; Stan Z. Li

Boosting the Power of Small Multimodal Reasoning Models to Match Larger Models with Self-Consistency Training

Cheng Tan, Jingxuan Wei, Zhangyang Gao, Linzhuang Sun, Siyuan Li, Ruifeng Guo, Bihui Yu, Stan Z. Li

TL;DR

The paper tackles the bottleneck in multimodal reasoning caused by low‑quality generated rationales. It introduces MC‑CoT, a training‑time self‑consistency approach that uses dropout to generate multiple rationales and answers, selecting them via token‑level and majority voting while keeping inference unchanged. The authors provide theoretical support showing aggregation reduces expected loss and help balance bias and variance through a mean‑plus‑weighted logits fusion. Empirically, MC‑CoT improves over Multimodal‑CoT on ScienceQA and A‑OKVQA, with small models achieving parity or superiority to larger baselines, highlighting both effectiveness and efficiency in vision–language reasoning.

Abstract

Multimodal reasoning is a challenging task that requires models to reason across multiple modalities to answer questions. Existing approaches have made progress by incorporating language and visual modalities into a two-stage reasoning framework, separating rationale generation from answer inference. However, these approaches often fall short due to the inadequate quality of the generated rationales. In this work, we delve into the importance of rationales in model reasoning. We observe that when rationales are completely accurate, the model's accuracy significantly improves, highlighting the need for high-quality rationale generation. Motivated by this, we propose MC-CoT, a self-consistency training strategy that generates multiple rationales and answers, subsequently selecting the most accurate through a voting process. This approach not only enhances the quality of generated rationales but also leads to more accurate and robust answers. Through extensive experiments, we demonstrate that our approach significantly improves model performance across various benchmarks. Remarkably, we show that even smaller base models, when equipped with our proposed approach, can achieve results comparable to those of larger models, illustrating the potential of our approach in harnessing the power of rationales for improved multimodal reasoning. The code is available at https://github.com/chengtan9907/mc-cot.

Boosting the Power of Small Multimodal Reasoning Models to Match Larger Models with Self-Consistency Training

TL;DR

Abstract

Paper Structure (27 sections, 12 equations, 7 figures, 4 tables, 1 algorithm)

This paper contains 27 sections, 12 equations, 7 figures, 4 tables, 1 algorithm.

Introduction
Related Work
Chain-of-Thought
Multimodal Visual Question Answering
Method
Preliminaries
Multimodal-CoT
Multimodal Consistent Chain-of-Thought
Rationale Generation
Answer Inference
Voting Strategy
Theoretical Insights
Aggregation Minimizes Expected Loss
Bias-Variance Trade-off
Experiment
...and 12 more sections

Figures (7)

Figure 1: An example of multimodal reasoning that answers the question by reasoning across both vision and language modalities.
Figure 2: The comparison of answer accuracy on ScienceQA using the Multimodal-CoT framework with no rationale, predicted rationales, and ground-truth rationales.
Figure 3: A comparison schematic diagram of different Chain-of-Thought (CoT) prompt-based reasoning methods. (a) The basic input-output prompt, (b) Chain-of-Thought with intermediate chain-like reasoning, (c) Chain-of-Thought Self-Consistency (CoT-SC) that utilizes the consistency of multiple independent chains of thoughts for reasoning, (d) Multimodal-CoT, which infers the rationale using the input text and image, and then predicts the answer using the rationale as part of the input, and (e) MC-CoT that infers a high-quality rationale through word-level voting, and then obtains a high-quality answer using majority vote. It is worth noting that our approach leverages multiple chain consistency only during the training phase, in contrast to CoT-SC, which employs it during the inference stage.
Figure 4: The relationship between rationale and answer.
Figure 5: Comparison on predicted examples.
...and 2 more figures

Boosting the Power of Small Multimodal Reasoning Models to Match Larger Models with Self-Consistency Training

TL;DR

Abstract

Boosting the Power of Small Multimodal Reasoning Models to Match Larger Models with Self-Consistency Training

Authors

TL;DR

Abstract

Table of Contents

Figures (7)