Table of Contents
Fetching ...

Answer-Consistent Chain-of-thought Reinforcement Learning For Multi-modal Large Langauge Models

Minbin Huang, Runhui Huang, Chuanyang Zheng, Jingyao Li, Guoxuan Chen, Han Shi, Hong Cheng

TL;DR

This work identifies a critical shortcoming of outcome-based reinforcement learning in multimodal LLMs: the reasoning trace can become misaligned with the final answer. It introduces Answer-Consistent Reinforcement Learning (ACRE), which adds a consistency-verification signal by re-prompting with shuffled options to reward coherent, robust reasoning. Across five benchmarks spanning video and multimodal math reasoning, ACRE yields consistent gains over GRPO in both accuracy and reasoning–answer alignment, as evidenced by higher CACR and OSCR. The approach enhances trustworthiness and robustness of multimodal reasoning while maintaining data efficiency, offering a practical path toward more reliable AI systems.

Abstract

Recent advances in large language models (LLMs) have demonstrated that reinforcement learning with verifiable rewards (RLVR) can significantly enhance reasoning abilities by directly optimizing correctness, rather than relying solely on supervised imitation. This paradigm has been extended to multimodal LLMs for complex video and image understanding tasks. However, while outcome-driven RL improves answer accuracy, it can inadvertently decouple the reasoning chain from the final answer, leading to situations where models produce inconsistency between the reasoning trace and final answer. In our experiments on multiple-choice visual question-answering tasks, the standard GRPO method yields only 79.7\% consistency on MMVU between the reasoning steps and the chosen answers, indicating frequent mismatches between answers and reasoning. To this end, we propose Answer-Consistent Reinforcement Learning (ACRE) that modifies the GRPO algorithm with an auxiliary consistency check. After the model generates a chain of thought and an initial answer for a given question, we shuffle the answer options and prompt the model again with the same reasoning trace to predict a second answer. We design a consistency-verification reward that grants a high reward only if both the original and the post-shuffle answers agree and are correct; otherwise, a lower reward is assigned accordingly. This mechanism penalizes reasoning-answer misalignment and discourages the model from relying on spurious patterns, such as option ordering biases. We evaluate ACRE on challenging Video Reasoning benchmarks and multimodal math reasoning benchmarks, achieving an average 2.2\% and 1.5\% improvement for Video Reasoning and Math Reasoning tasks over the GRPO baseline.

Answer-Consistent Chain-of-thought Reinforcement Learning For Multi-modal Large Langauge Models

TL;DR

This work identifies a critical shortcoming of outcome-based reinforcement learning in multimodal LLMs: the reasoning trace can become misaligned with the final answer. It introduces Answer-Consistent Reinforcement Learning (ACRE), which adds a consistency-verification signal by re-prompting with shuffled options to reward coherent, robust reasoning. Across five benchmarks spanning video and multimodal math reasoning, ACRE yields consistent gains over GRPO in both accuracy and reasoning–answer alignment, as evidenced by higher CACR and OSCR. The approach enhances trustworthiness and robustness of multimodal reasoning while maintaining data efficiency, offering a practical path toward more reliable AI systems.

Abstract

Recent advances in large language models (LLMs) have demonstrated that reinforcement learning with verifiable rewards (RLVR) can significantly enhance reasoning abilities by directly optimizing correctness, rather than relying solely on supervised imitation. This paradigm has been extended to multimodal LLMs for complex video and image understanding tasks. However, while outcome-driven RL improves answer accuracy, it can inadvertently decouple the reasoning chain from the final answer, leading to situations where models produce inconsistency between the reasoning trace and final answer. In our experiments on multiple-choice visual question-answering tasks, the standard GRPO method yields only 79.7\% consistency on MMVU between the reasoning steps and the chosen answers, indicating frequent mismatches between answers and reasoning. To this end, we propose Answer-Consistent Reinforcement Learning (ACRE) that modifies the GRPO algorithm with an auxiliary consistency check. After the model generates a chain of thought and an initial answer for a given question, we shuffle the answer options and prompt the model again with the same reasoning trace to predict a second answer. We design a consistency-verification reward that grants a high reward only if both the original and the post-shuffle answers agree and are correct; otherwise, a lower reward is assigned accordingly. This mechanism penalizes reasoning-answer misalignment and discourages the model from relying on spurious patterns, such as option ordering biases. We evaluate ACRE on challenging Video Reasoning benchmarks and multimodal math reasoning benchmarks, achieving an average 2.2\% and 1.5\% improvement for Video Reasoning and Math Reasoning tasks over the GRPO baseline.

Paper Structure

This paper contains 24 sections, 9 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Reasoning-Answer inconsistency of GRPO models. Red denotes correct answer or reasoning trace and orange denotes flawed answer or reasoning trace. The top one is an example of Correct Reasoning but Wrong Answer, while the bottom one is an example of Wrong Reasoning but Correct Answer.
  • Figure 2: Overview of our proposed ACRE. Given a multi-modal input, the MLLM first generates a reasoning trace and a final answer (top path). We then feed the same reasoning trace back to the MLLM along with an auxiliary query where the answer options are shuffled (bottom path). The consistency between the final answers from both paths serves as a reward signal for reinforcement learning, encouraging the model to generate reasoning that is logically sound and independent of option positioning.
  • Figure 3: Attention Visualization Comparison between GRPO and ACRE
  • Figure 4: Visualizations of GRPO and ACRE. Red denotes correct answer or reasoning trace and orange denotes flawed answer or reasoning trace.