Table of Contents
Fetching ...

The Quest for Efficient Reasoning: A Data-Centric Benchmark to CoT Distillation

Ruichen Zhang, Rana Muhammad Shahroz Khan, Zhen Tan, Dawei Li, Song Wang, Tianlong Chen

TL;DR

The paper introduces DC-CoT, a unified benchmark to systematically study data-centric Chain-of-Thought distillation, focusing on augmentation, selection, and mixing across method, model, and data perspectives. By evaluating diverse teacher–student pairings and reasoning tasks, it shows that data augmentation—especially reverse reasoning—yields the strongest gains, while data filtering via LLM-based judges and careful mixing offer nuanced benefits. The results highlight the importance of teacher–student compatibility, data quality, and dataset characteristics for IID/OOD generalization and cross-domain transfer. Overall, DC-CoT provides actionable guidelines to optimize CoT distillation for smaller, more capable reasoning models and sets a foundation for future data-centric improvements in efficient LLM reasoning.

Abstract

Data-centric distillation, including data augmentation, selection, and mixing, offers a promising path to creating smaller, more efficient student Large Language Models (LLMs) that retain strong reasoning abilities. However, there still lacks a comprehensive benchmark to systematically assess the effect of each distillation approach. This paper introduces DC-CoT, the first data-centric benchmark that investigates data manipulation in chain-of-thought (CoT) distillation from method, model and data perspectives. Utilizing various teacher models (e.g., o4-mini, Gemini-Pro, Claude-3.5) and student architectures (e.g., 3B, 7B parameters), we rigorously evaluate the impact of these data manipulations on student model performance across multiple reasoning datasets, with a focus on in-distribution (IID) and out-of-distribution (OOD) generalization, and cross-domain transfer. Our findings aim to provide actionable insights and establish best practices for optimizing CoT distillation through data-centric techniques, ultimately facilitating the development of more accessible and capable reasoning models. The dataset can be found at https://huggingface.co/datasets/rana-shahroz/DC-COT, while our code is shared in https://anonymous.4open.science/r/DC-COT-FF4C/.

The Quest for Efficient Reasoning: A Data-Centric Benchmark to CoT Distillation

TL;DR

The paper introduces DC-CoT, a unified benchmark to systematically study data-centric Chain-of-Thought distillation, focusing on augmentation, selection, and mixing across method, model, and data perspectives. By evaluating diverse teacher–student pairings and reasoning tasks, it shows that data augmentation—especially reverse reasoning—yields the strongest gains, while data filtering via LLM-based judges and careful mixing offer nuanced benefits. The results highlight the importance of teacher–student compatibility, data quality, and dataset characteristics for IID/OOD generalization and cross-domain transfer. Overall, DC-CoT provides actionable guidelines to optimize CoT distillation for smaller, more capable reasoning models and sets a foundation for future data-centric improvements in efficient LLM reasoning.

Abstract

Data-centric distillation, including data augmentation, selection, and mixing, offers a promising path to creating smaller, more efficient student Large Language Models (LLMs) that retain strong reasoning abilities. However, there still lacks a comprehensive benchmark to systematically assess the effect of each distillation approach. This paper introduces DC-CoT, the first data-centric benchmark that investigates data manipulation in chain-of-thought (CoT) distillation from method, model and data perspectives. Utilizing various teacher models (e.g., o4-mini, Gemini-Pro, Claude-3.5) and student architectures (e.g., 3B, 7B parameters), we rigorously evaluate the impact of these data manipulations on student model performance across multiple reasoning datasets, with a focus on in-distribution (IID) and out-of-distribution (OOD) generalization, and cross-domain transfer. Our findings aim to provide actionable insights and establish best practices for optimizing CoT distillation through data-centric techniques, ultimately facilitating the development of more accessible and capable reasoning models. The dataset can be found at https://huggingface.co/datasets/rana-shahroz/DC-COT, while our code is shared in https://anonymous.4open.science/r/DC-COT-FF4C/.

Paper Structure

This paper contains 28 sections, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Overview of DC-CoT pipeline.
  • Figure 2: Data-centric augmentation flow. Teacher CoT traces are independently transformed by four operations: Rephrase Question, Question Augmentation, Answer Augmentation, and Reverse Thinking.
  • Figure 3: Reverse-Thinking augmentation pipeline: from each (question, answer) pair, generate forward reasoning, synthesize a backward question with its reasoning, then keep only examples whose forward-backward chains pass a consistency check.
  • Figure 3: Impact of teacher model on agentic (WebArena) and visual (Visual-CoT) performance.
  • Figure 4: Data-filtering pipeline in DC-CoT. A teacher-generated CoT pool is refined through three selectors.
  • ...and 4 more figures