Bridging Efficiency and Transparency: Explainable CoT Compression in Multimodal Large Reasoning Models

Yizhi Wang; Linan Yue; Min-Ling Zhang

Bridging Efficiency and Transparency: Explainable CoT Compression in Multimodal Large Reasoning Models

Yizhi Wang, Linan Yue, Min-Ling Zhang

TL;DR

The paper tackles inefficiency and opacity in long multimodal CoTs by introducing XMCC, an explainable CoT compressor trained with reinforcement learning (GRPO). It formulates compression as a sequential decision process, optimizing a four-component reward to shorten CoTs while preserving visual grounding and providing natural-language explanations. XMCC synthesizes diverse long CoTs, trains on a multi-stage pipeline, and then applies supervised fine-tuning to produce efficient reasoning with preserved accuracy. Across multiple multimodal benchmarks and a dedicated XMCC-Dataset, XMCC achieves substantial CoT length reduction, strong task performance, and improved visual grounding and explanation quality, demonstrating practical potential for faster and more transparent multimodal reasoning systems.

Abstract

Long chains of thought (Long CoTs) are widely employed in multimodal reasoning models to tackle complex tasks by capturing detailed visual information. However, these Long CoTs are often excessively lengthy and contain redundant reasoning steps, which can hinder inference efficiency. Compressing these long CoTs is a natural solution, yet existing approaches face two major challenges: (1) they may compromise the integrity of visual-textual reasoning by removing essential alignment cues, and (2) the compression process lacks explainability, making it difficult to discern which information is critical. To address these problems, we propose XMCC, an eXplainable Multimodal CoT Compressor that formulates compression as a sequential decision-making process optimized via reinforcement learning. XMCC can effectively shorten reasoning trajectories while preserving key reasoning steps and answer correctness, and simultaneously generates natural-language explanations for its compression decisions. Extensive experiments on representative multimodal reasoning benchmarks demonstrate that XMCC not only reduces reasoning length but also provides explainable explanations, validating its effectiveness.

Bridging Efficiency and Transparency: Explainable CoT Compression in Multimodal Large Reasoning Models

TL;DR

Abstract

Paper Structure (24 sections, 7 equations, 7 figures, 5 tables)

This paper contains 24 sections, 7 equations, 7 figures, 5 tables.

Introduction
Related Work
Multimodal Reasoning
CoT Compression
Explainable Multimodal CoT Compressor
Overview of the Compressor
Long CoT Data Synthesis
Explainable Compressor Training
Reward Design
RL Training with GRPO
SFT for Efficient Reasoning
Experiments
Experiment Settings
Main Results
Evaluation of Visual Information Preservation
...and 9 more sections

Figures (7)

Figure 1: Differences between existing text-based CoT compression methods and XMCC. (a) shows the compressed CoT produced by a text-based compression method, while (b) shows the result from XMCC. In (a), each "[SKIP]" represents a deleted step. It can be observed that the text-based method erroneously removes critical visually grounded information that defines variable meanings (e.g., "Height of bamboo pole = h_1"). In contrast, XMCC preserves these critical alignment cues.
Figure 2: Overview of XMCC. (a) The framework consists of three stages: (I) synthesizing diverse long CoTs from heterogeneous MLRMs; (II) training an explainable compressor via RL with the proposed reward function; and (III) SFT on compressed CoTs for efficient inference. (b) In the proposed reward function, step-wise criticality reward evaluating each segment's contribution to task performance, to ensure the quality of compressed reasoning. The length reward adapts compression intensity to task complexity.
Figure 3: Analysis of input CoT quantity. From left to right: model accuracy, average reasoning length, and the accuracy-to-length ratio as functions of the number of input CoTs. As shown, increasing the number of input CoTs improves both task performance and efficiency.
Figure 4: Case Study on SFT Models. Text in the box at the lower left corner is generated by the model fine-tuned on XMCC data, while text in right box is generated by the model fine-tuned on uncompressed CoTs.
Figure 5: Case Study on SFT Models. Text in the box at the lower left corner is generated by the model fine-tuned on XMCC data, while text in right box is generated by the model fine-tuned on uncompressed CoTs.
...and 2 more figures

Theorems & Definitions (2)

Remark 3.1
Remark 3.2

Bridging Efficiency and Transparency: Explainable CoT Compression in Multimodal Large Reasoning Models

TL;DR

Abstract

Bridging Efficiency and Transparency: Explainable CoT Compression in Multimodal Large Reasoning Models

Authors

TL;DR

Abstract

Table of Contents

Figures (7)

Theorems & Definitions (2)