Table of Contents
Fetching ...

Efficient Reasoning via Thought Compression for Language Segmentation

Qing Zhou, Shiyu Zhang, Yuyu Jia, Junyu Gao, Weiping Ni, Junzheng Wu, Qi Wang

Abstract

Chain-of-thought (CoT) reasoning has significantly improved the performance of large multimodal models in language-guided segmentation, yet its prohibitive computational cost, stemming from generating verbose rationales, limits real-world applicability. We introduce WISE (Wisdom from Internal Self-Exploration), a novel paradigm for efficient reasoning guided by the principle of \textit{thinking twice -- once for learning, once for speed}. WISE trains a model to generate a structured sequence: a concise rationale, the final answer, and then a detailed explanation. By placing the concise rationale first, our method leverages autoregressive conditioning to enforce that the concise rationale acts as a sufficient summary for generating the detailed explanation. This structure is reinforced by a self-distillation objective that jointly rewards semantic fidelity and conciseness, compelling the model to internalize its detailed reasoning into a compact form. At inference, the detailed explanation is omitted. To address the resulting conditional distribution shift, our inference strategy, WISE-S, employs a simple prompting technique that injects a brevity-focused instruction into the user's query. This final adjustment facilitates the robust activation of the learned concise policy, unlocking the full benefits of our framework. Extensive experiments show that WISE-S achieves state-of-the-art zero-shot performance on the ReasonSeg benchmark with 58.3 cIoU, while reducing the average reasoning length by nearly \textbf{5$\times$} -- from 112 to just 23 tokens. Code is available at \href{https://github.com/mrazhou/WISE}{WISE}.

Efficient Reasoning via Thought Compression for Language Segmentation

Abstract

Chain-of-thought (CoT) reasoning has significantly improved the performance of large multimodal models in language-guided segmentation, yet its prohibitive computational cost, stemming from generating verbose rationales, limits real-world applicability. We introduce WISE (Wisdom from Internal Self-Exploration), a novel paradigm for efficient reasoning guided by the principle of \textit{thinking twice -- once for learning, once for speed}. WISE trains a model to generate a structured sequence: a concise rationale, the final answer, and then a detailed explanation. By placing the concise rationale first, our method leverages autoregressive conditioning to enforce that the concise rationale acts as a sufficient summary for generating the detailed explanation. This structure is reinforced by a self-distillation objective that jointly rewards semantic fidelity and conciseness, compelling the model to internalize its detailed reasoning into a compact form. At inference, the detailed explanation is omitted. To address the resulting conditional distribution shift, our inference strategy, WISE-S, employs a simple prompting technique that injects a brevity-focused instruction into the user's query. This final adjustment facilitates the robust activation of the learned concise policy, unlocking the full benefits of our framework. Extensive experiments show that WISE-S achieves state-of-the-art zero-shot performance on the ReasonSeg benchmark with 58.3 cIoU, while reducing the average reasoning length by nearly \textbf{5} -- from 112 to just 23 tokens. Code is available at \href{https://github.com/mrazhou/WISE}{WISE}.

Paper Structure

This paper contains 17 sections, 8 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: WISE achieves a superior cost-performance trade-off by decoupling reasoning for learning and inference. Our framework trains the model with detailed explanations but uses a distilled, concise rationale at inference. The base model, WISE, reduces reasoning cost by an average of 4.7$\times$ while outperforming the baseline. Our final model, WISE-S, applies an additional inference-time prompt, achieving even greater compression (up to 10.6$\times$ on RefCOCO) and the highest overall performance. On ReasonSeg, WISE-S simultaneously improves performance by 3.9 cIoU and cuts token cost by 4.9$\times$.
  • Figure 2: Overview of the WISE framework. During training (orange), the reasoning model $\mathcal{F}_{\text{reason}}$ generates a structured sequence: concise rationale ($\tau_c$), answer ($A$), and detailed explanation ($\tau_d$). The process is optimized via GRPO using a hierarchical reward, notably including a self-distillation term ($\mathcal{R}_{\text{distill}}$) that aligns $\tau_c$ with $\tau_d$ while enforcing brevity. The segmentation model $\mathcal{F}_{\text{seg}}$ remains frozen. At inference (blue), $\tau_d$ is omitted and a brevity prompt ($T_b$) is added to ensure highly efficient reasoning.
  • Figure 3: Prompting Strategy for WISE. The figure shows the basic instruction template and the specific components for generating concise ($T_c$) and detailed ($T_d$) rationales. Different combinations of these components are used for the training phase, standard WISE inference, and the brevity-focused WISE-S inference.
  • Figure 4: Distribution of Reasoning Token Length. The KDE plot illustrates the dramatic reduction in both length and variance. WISE-S (green) demonstrates a tightly concentrated preference for conciseness compared to the long-tailed Seg-Zero baseline (purple).
  • Figure 5: Qualitative Comparison. Given a complex instruction, Seg-Zero produces a convoluted, 132-token rationale and fails. In contrast, WISE-S generates a concise, correct reasoning chain, successfully identifying the target with a fraction of the cost.
  • ...and 2 more figures