Table of Contents
Fetching ...

D-COT: Disciplined Chain-of-Thought Learning for Efficient Reasoning in Small Language Models

Shunsuke Ubukata

TL;DR

Disciplined Chain-of-Thought (D-CoT) is proposed, a novel framework that enforces a structured reasoning process using control tags -- such as fact-checking and multi-perspective exploration -- as auxiliary scaffolding during training, and internalizes this disciplined thought structure.

Abstract

Chain-of-Thought (CoT) distillation from Large Language Models (LLMs) often induces "overthinking" in Small Language Models (SLMs), leading to performance degradation and excessive token consumption. In this study, we propose Disciplined Chain-of-Thought (D-CoT), a novel framework that enforces a structured reasoning process using control tags -- such as <TEMP_LOW> for fact-checking and <TEMP_HIGH> for multi-perspective exploration -- as auxiliary scaffolding during training. By optimizing the CoT trajectory, D-CoT suppresses reasoning drift and simultaneously achieves token reduction and performance improvement. We demonstrate the efficacy of our approach on Qwen3-8B: with only 5,000 training samples, D-CoT significantly boosts accuracy on GPQA-diamond by 9.9% and MMLU-Pro (0-shot) by 9.1%, while drastically reducing computational costs. Furthermore, we confirm that the model internalizes this disciplined thought structure, maintaining high performance even without explicit control tags during inference.

D-COT: Disciplined Chain-of-Thought Learning for Efficient Reasoning in Small Language Models

TL;DR

Disciplined Chain-of-Thought (D-CoT) is proposed, a novel framework that enforces a structured reasoning process using control tags -- such as fact-checking and multi-perspective exploration -- as auxiliary scaffolding during training, and internalizes this disciplined thought structure.

Abstract

Chain-of-Thought (CoT) distillation from Large Language Models (LLMs) often induces "overthinking" in Small Language Models (SLMs), leading to performance degradation and excessive token consumption. In this study, we propose Disciplined Chain-of-Thought (D-CoT), a novel framework that enforces a structured reasoning process using control tags -- such as <TEMP_LOW> for fact-checking and <TEMP_HIGH> for multi-perspective exploration -- as auxiliary scaffolding during training. By optimizing the CoT trajectory, D-CoT suppresses reasoning drift and simultaneously achieves token reduction and performance improvement. We demonstrate the efficacy of our approach on Qwen3-8B: with only 5,000 training samples, D-CoT significantly boosts accuracy on GPQA-diamond by 9.9% and MMLU-Pro (0-shot) by 9.1%, while drastically reducing computational costs. Furthermore, we confirm that the model internalizes this disciplined thought structure, maintaining high performance even without explicit control tags during inference.
Paper Structure (29 sections, 3 figures, 2 tables)

This paper contains 29 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Comparison of reasoning traces on MMLU-Pro #238 (promissory note discount problem, Qwen3-8B). (a) Conventional distilled CoT exhibits overthinking: the model reaches the correct answer but continues recalculating, ultimately exhausting the token budget and producing an incorrect prediction. (b) D-CoT with explicit control tags structures reasoning into fact organization (<TEMP_LOW>), multi-perspective exploration (<TEMP_HIGH>), and algorithmic computation (<TEMP_MID>), converging efficiently. (c) After D-CoT training, the model internalizes this disciplined structure and maintains organized reasoning without explicit tags, achieving the best accuracy on MMLU-Pro (64.73%).
  • Figure 2: Cosine similarity distributions between D-CoT training samples and benchmark test sets. Left: MMLU-Pro (102 samples removed above the 0.55 threshold). Right: GPQA-Diamond (0 samples removed). The low similarity confirms that training domains are well-separated from evaluation benchmarks, and that observed performance gains stem from reasoning structure acquisition rather than knowledge leakage.
  • Figure 3: D-CoT: Accuracy vs. Average Output Tokens (Qwen3-8B, 0-shot). (A) MMLU-Pro (12k questions). (B) GPQA-Diamond (5-seed average). D-CoT conditions (red) cluster in the upper-left (high accuracy, low tokens), demonstrating a Pareto improvement over the Baseline conditions (blue). The inter-group separation far exceeds variation from temperature or prompt settings.