Table of Contents
Fetching ...

Rationale-Enhanced Decoding for Multi-modal Chain-of-Thought

Shin'ya Yamaguchi, Kosuke Nishida, Daiki Chijiwa

Abstract

Large vision-language models (LVLMs) have demonstrated remarkable capabilities by integrating pre-trained vision encoders with large language models (LLMs). Similar to single-modal LLMs, chain-of-thought (CoT) prompting has been adapted for LVLMs to enhance multi-modal reasoning by generating intermediate rationales based on visual and textual inputs. While CoT is assumed to improve grounding and accuracy in LVLMs, our experiments reveal a key challenge: existing LVLMs often ignore the contents of generated rationales in CoT reasoning. To address this, we re-formulate multi-modal CoT reasoning as a KL-constrained reward maximization focused on rationale-conditional log-likelihood. As the optimal solution, we propose rationale-enhanced decoding (RED), a novel plug-and-play inference-time decoding strategy. RED harmonizes visual and rationale information by multiplying distinct image-conditional and rationale-conditional next token distributions. Extensive experiments show that RED consistently and significantly improves reasoning over standard CoT and other decoding methods across multiple benchmarks and LVLMs. Our work offers a practical and effective approach to improve both the faithfulness and accuracy of CoT reasoning in LVLMs, paving the way for more reliable rationale-grounded multi-modal systems. Code is available at https://github.com/yshinya6/red/.

Rationale-Enhanced Decoding for Multi-modal Chain-of-Thought

Abstract

Large vision-language models (LVLMs) have demonstrated remarkable capabilities by integrating pre-trained vision encoders with large language models (LLMs). Similar to single-modal LLMs, chain-of-thought (CoT) prompting has been adapted for LVLMs to enhance multi-modal reasoning by generating intermediate rationales based on visual and textual inputs. While CoT is assumed to improve grounding and accuracy in LVLMs, our experiments reveal a key challenge: existing LVLMs often ignore the contents of generated rationales in CoT reasoning. To address this, we re-formulate multi-modal CoT reasoning as a KL-constrained reward maximization focused on rationale-conditional log-likelihood. As the optimal solution, we propose rationale-enhanced decoding (RED), a novel plug-and-play inference-time decoding strategy. RED harmonizes visual and rationale information by multiplying distinct image-conditional and rationale-conditional next token distributions. Extensive experiments show that RED consistently and significantly improves reasoning over standard CoT and other decoding methods across multiple benchmarks and LVLMs. Our work offers a practical and effective approach to improve both the faithfulness and accuracy of CoT reasoning in LVLMs, paving the way for more reliable rationale-grounded multi-modal systems. Code is available at https://github.com/yshinya6/red/.

Paper Structure

This paper contains 28 sections, 1 theorem, 12 equations, 6 figures, 7 tables, 1 algorithm.

Key Result

Theorem 4.1

Let the reference policy $\pi_\mathrm{ref}$ be $p_\theta(y_i|\bm{y}_{<i},x,q)$, and the reward function $R(\cdot)$ be $\log p_\theta(y_i|\bm{y}_{<i}, r, q)$. Sampling by Eq. (eq:red) is equivalent to sampling from the optimal policy $\pi^*$ for Eq. (eq:red_objective).

Figures (6)

  • Figure 1: Rationale-Enhanced Decoding (RED). Existing multi-modal chain-of-thought (CoT) prompting by large vision language models (LVLMs) is a two-step generation of rationale and final output. It often focuses on input images and overlooks intermediate rationales in the final output generation. Our rationale-enhanced decoding (RED) addresses this issue by decoupling the image and rationale in decoding, and combining them at the logit level to provably ensure grounding outputs on the rationale.
  • Figure 2: Percentage of attention contributions by input token types for different decoding strategies (Gemma-3-12B). Rationale tokens contribute less to outputs than image tokens.
  • Figure 3: Performance trends on GQA accuracy with increasing LVLM parameter sizes (Baseline, CCoT, CCoT + RED for Gemma-3 and Qwen2.5-VL families). RED can consistently improve Baseline and CCoT in any models, and can unlock further performance scalability according to model sizes.
  • Figure 4: Qualitative examples of CoT reasoning on GQA (Gemma-3-12B). For each example, the input image (left), the query [Q], the generated rationale [R] (text for CoT and JSON for CCoT), and the answers from Baseline, CoT/CCoT, and our method (RED) [A] are shown. RED successfully leverages the rationale to produce the correct answer.
  • Figure 5: Effects of $\lambda$ in RED on the GQA Validation Set.
  • ...and 1 more figures

Theorems & Definitions (2)

  • Theorem 4.1
  • proof