Table of Contents
Fetching ...

Bridging Efficiency and Transparency: Explainable CoT Compression in Multimodal Large Reasoning Models

Yizhi Wang, Linan Yue, Min-Ling Zhang

TL;DR

The paper tackles inefficiency and opacity in long multimodal CoTs by introducing XMCC, an explainable CoT compressor trained with reinforcement learning (GRPO). It formulates compression as a sequential decision process, optimizing a four-component reward to shorten CoTs while preserving visual grounding and providing natural-language explanations. XMCC synthesizes diverse long CoTs, trains on a multi-stage pipeline, and then applies supervised fine-tuning to produce efficient reasoning with preserved accuracy. Across multiple multimodal benchmarks and a dedicated XMCC-Dataset, XMCC achieves substantial CoT length reduction, strong task performance, and improved visual grounding and explanation quality, demonstrating practical potential for faster and more transparent multimodal reasoning systems.

Abstract

Long chains of thought (Long CoTs) are widely employed in multimodal reasoning models to tackle complex tasks by capturing detailed visual information. However, these Long CoTs are often excessively lengthy and contain redundant reasoning steps, which can hinder inference efficiency. Compressing these long CoTs is a natural solution, yet existing approaches face two major challenges: (1) they may compromise the integrity of visual-textual reasoning by removing essential alignment cues, and (2) the compression process lacks explainability, making it difficult to discern which information is critical. To address these problems, we propose XMCC, an eXplainable Multimodal CoT Compressor that formulates compression as a sequential decision-making process optimized via reinforcement learning. XMCC can effectively shorten reasoning trajectories while preserving key reasoning steps and answer correctness, and simultaneously generates natural-language explanations for its compression decisions. Extensive experiments on representative multimodal reasoning benchmarks demonstrate that XMCC not only reduces reasoning length but also provides explainable explanations, validating its effectiveness.

Bridging Efficiency and Transparency: Explainable CoT Compression in Multimodal Large Reasoning Models

TL;DR

The paper tackles inefficiency and opacity in long multimodal CoTs by introducing XMCC, an explainable CoT compressor trained with reinforcement learning (GRPO). It formulates compression as a sequential decision process, optimizing a four-component reward to shorten CoTs while preserving visual grounding and providing natural-language explanations. XMCC synthesizes diverse long CoTs, trains on a multi-stage pipeline, and then applies supervised fine-tuning to produce efficient reasoning with preserved accuracy. Across multiple multimodal benchmarks and a dedicated XMCC-Dataset, XMCC achieves substantial CoT length reduction, strong task performance, and improved visual grounding and explanation quality, demonstrating practical potential for faster and more transparent multimodal reasoning systems.

Abstract

Long chains of thought (Long CoTs) are widely employed in multimodal reasoning models to tackle complex tasks by capturing detailed visual information. However, these Long CoTs are often excessively lengthy and contain redundant reasoning steps, which can hinder inference efficiency. Compressing these long CoTs is a natural solution, yet existing approaches face two major challenges: (1) they may compromise the integrity of visual-textual reasoning by removing essential alignment cues, and (2) the compression process lacks explainability, making it difficult to discern which information is critical. To address these problems, we propose XMCC, an eXplainable Multimodal CoT Compressor that formulates compression as a sequential decision-making process optimized via reinforcement learning. XMCC can effectively shorten reasoning trajectories while preserving key reasoning steps and answer correctness, and simultaneously generates natural-language explanations for its compression decisions. Extensive experiments on representative multimodal reasoning benchmarks demonstrate that XMCC not only reduces reasoning length but also provides explainable explanations, validating its effectiveness.
Paper Structure (24 sections, 7 equations, 7 figures, 5 tables)

This paper contains 24 sections, 7 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Differences between existing text-based CoT compression methods and XMCC. (a) shows the compressed CoT produced by a text-based compression method, while (b) shows the result from XMCC. In (a), each "[SKIP]" represents a deleted step. It can be observed that the text-based method erroneously removes critical visually grounded information that defines variable meanings (e.g., "Height of bamboo pole = h_1"). In contrast, XMCC preserves these critical alignment cues.
  • Figure 2: Overview of XMCC. (a) The framework consists of three stages: (I) synthesizing diverse long CoTs from heterogeneous MLRMs; (II) training an explainable compressor via RL with the proposed reward function; and (III) SFT on compressed CoTs for efficient inference. (b) In the proposed reward function, step-wise criticality reward evaluating each segment's contribution to task performance, to ensure the quality of compressed reasoning. The length reward adapts compression intensity to task complexity.
  • Figure 3: Analysis of input CoT quantity. From left to right: model accuracy, average reasoning length, and the accuracy-to-length ratio as functions of the number of input CoTs. As shown, increasing the number of input CoTs improves both task performance and efficiency.
  • Figure 4: Case Study on SFT Models. Text in the box at the lower left corner is generated by the model fine-tuned on XMCC data, while text in right box is generated by the model fine-tuned on uncompressed CoTs.
  • Figure 5: Case Study on SFT Models. Text in the box at the lower left corner is generated by the model fine-tuned on XMCC data, while text in right box is generated by the model fine-tuned on uncompressed CoTs.
  • ...and 2 more figures

Theorems & Definitions (2)

  • Remark 3.1
  • Remark 3.2