Table of Contents
Fetching ...

To Think or Not to Think: The Hidden Cost of Meta-Training with Excessive CoT Examples

Vignesh Kothapalli, Ata Fatahibaarzi, Hamed Firooz, Maziar Sanjabi

TL;DR

<3-5 sentence high-level summary> The paper addresses the brittleness of chain-of-thought in-context learning when task knowledge is novel or under-specified. It introduces CoT-ICL Lab-2.0, a framework with special tokens and DAG-driven sequence generation, and the CoT-Recipe method to modulate the mix of CoT and non-CoT examples during meta-training. Through controlled experiments on abstract reasoning and symbolic tasks, it demonstrates that carefully tuned CoT-Recipe can significantly boost performance and enable reasoning even without CoT in-context prompts, including when transferring to pretrained LLMs like Qwen-2.5. It further shows that data diversity and forcing strategies influence length generalization and OOD performance, and provides practical guidance for selecting the CoT mix parameter alpha. These results highlight a principled approach to shaping meta-training data for improved reasoning in transformers and large language models.

Abstract

Chain-of-thought (CoT) prompting combined with few-shot in-context learning (ICL) has unlocked significant reasoning capabilities in large language models (LLMs). However, ICL with CoT examples is ineffective on novel tasks when the pre-training knowledge is insufficient. We study this problem in a controlled setting using the CoT-ICL Lab framework, and propose meta-training techniques to learn novel abstract reasoning tasks in-context. Although CoT examples facilitate reasoning, we noticed that their excessive inclusion during meta-training degrades performance when CoT supervision is limited. To mitigate such behavior, we propose CoT-Recipe, a formal approach to modulate the mix of CoT and non-CoT examples in meta-training sequences. We demonstrate that careful modulation via CoT-Recipe can increase the accuracy of transformers on novel tasks by up to 300% even when there are no CoT examples available in-context. We confirm the broader effectiveness of these techniques by applying them to pretrained LLMs (Qwen2.5 series) for symbolic reasoning tasks and observing gains of up to 130% in accuracy.

To Think or Not to Think: The Hidden Cost of Meta-Training with Excessive CoT Examples

TL;DR

<3-5 sentence high-level summary> The paper addresses the brittleness of chain-of-thought in-context learning when task knowledge is novel or under-specified. It introduces CoT-ICL Lab-2.0, a framework with special tokens and DAG-driven sequence generation, and the CoT-Recipe method to modulate the mix of CoT and non-CoT examples during meta-training. Through controlled experiments on abstract reasoning and symbolic tasks, it demonstrates that carefully tuned CoT-Recipe can significantly boost performance and enable reasoning even without CoT in-context prompts, including when transferring to pretrained LLMs like Qwen-2.5. It further shows that data diversity and forcing strategies influence length generalization and OOD performance, and provides practical guidance for selecting the CoT mix parameter alpha. These results highlight a principled approach to shaping meta-training data for improved reasoning in transformers and large language models.

Abstract

Chain-of-thought (CoT) prompting combined with few-shot in-context learning (ICL) has unlocked significant reasoning capabilities in large language models (LLMs). However, ICL with CoT examples is ineffective on novel tasks when the pre-training knowledge is insufficient. We study this problem in a controlled setting using the CoT-ICL Lab framework, and propose meta-training techniques to learn novel abstract reasoning tasks in-context. Although CoT examples facilitate reasoning, we noticed that their excessive inclusion during meta-training degrades performance when CoT supervision is limited. To mitigate such behavior, we propose CoT-Recipe, a formal approach to modulate the mix of CoT and non-CoT examples in meta-training sequences. We demonstrate that careful modulation via CoT-Recipe can increase the accuracy of transformers on novel tasks by up to 300% even when there are no CoT examples available in-context. We confirm the broader effectiveness of these techniques by applying them to pretrained LLMs (Qwen2.5 series) for symbolic reasoning tasks and observing gains of up to 130% in accuracy.

Paper Structure

This paper contains 71 sections, 1 theorem, 14 equations, 45 figures, 3 tables, 6 algorithms.

Key Result

Theorem B.1

Consider a dataset ${\mathcal{D}}$ of $T$ sequences created using the tuple $N,M,C,K$. Let the CoT-Recipeeq:cot_recipe with $a=1,b=0$ determine the CoT probability parameter $r^{(j)}_{\text{CoT}}, \forall j \in [0, T-1]$ as follows: $r^{(j)}_{\text{CoT}} = \left(\frac{j}{T}\right)^{\alpha}$. Then th , where $\mathbb{E}\left[|{\mathcal{D}}|_{CoT-ex}\right]$ is given by:

Figures (45)

  • Figure 1: The CoT-ICL Lab-2.0 framework. (1) We incorporate special tokens (marked in blue) to act as delimiter tokens between input, intermediate/thinking, and answer tokens. (2) Each sequence is constructed using a specific DAG, which determines the number of input and chain tokens per example. (3) The choice of $r_{CoT}$ is varied across sequences as per the CoT-Recipe and modulates the mix of CoT/standard examples for meta-training.
  • Figure 2: ${\texttt{accuracy}}$ of models trained with varying $\alpha$, ${\mathbf{N}}={\mathbf{M}}={\mathbf{C}}=\{4\}$ and evaluated on datasets with $\widetilde{{\mathbf{N}}}=\widetilde{{\mathbf{M}}}=\widetilde{{\mathbf{C}}}=\{4\}$. Here $K'$ indicates the number of standard examples in test prompts.
  • Figure 3: Input length generalization of TF-12 models when tested with $\widetilde{{\mathbf{N}}}=\{5\}, \widetilde{{\mathbf{M}}}=\widetilde{{\mathbf{C}}}=\{4\}$.
  • Figure 4: Chat template of a CoT/standard example in CIL-LangSym based on the Qwen-2.5-1.5B-Instruct tokenizer. Given $N=4, M=2, C=3$ and word length $W=8$, the DAG determines the ground truth causal dependencies, and the transform function illustrates the string processing of the $M$ parent words. We apply the chat template to differentiate the question, thinking, and final answer segments of the examples and also ensure that the task description does not reveal the underlying string transformation in natural language.
  • Figure 5: ${\texttt{accuracy}}$ of models trained with varying $\alpha$, ${\mathbf{N}}=\{4\}, {\mathbf{M}}=\{2\}, {\mathbf{C}}=\{3\}$ and evaluated on datasets with $\widetilde{{\mathbf{N}}}=\{4\}, \widetilde{{\mathbf{M}}}=\{2\}, \widetilde{{\mathbf{C}}}=\{3\}$.
  • ...and 40 more figures

Theorems & Definitions (2)

  • Theorem B.1
  • proof