Table of Contents
Fetching ...

CoT-ICL Lab: A Synthetic Framework for Studying Chain-of-Thought Learning from In-Context Demonstrations

Vignesh Kothapalli, Hamed Firooz, Maziar Sanjabi

TL;DR

CoT-ICL Lab presents a tokenized, synthetic framework to study chain-of-thought in-context learning by decoupling the causal structure ${\mathcal{G}}$ from the token-processing functions ${\mathcal{H}}$. It enables controlled experiments with DAG-based reasoning, multi-input ICL sequences, and varied token processors, trained on decoder-only transformers up to ${730\times 10^6}$ parameters. The main findings show that chain-of-thought prompts accelerate accuracy transitions across model sizes, with depth being crucial when in-context demonstrations are limited and more examples aiding shallower models; constraining the diversity of token processors can improve causal-structure learning, and embedding/alignment and attention analyses reveal how models infer the DAG. The framework also reveals connections to NLP, such as faster adaptation for pre-trained models and sparse attention patterns in reasoning, offering a versatile testbed for theoretical and empirical exploration of CoT-ICL in language tasks. While synthetic, the results highlight how controlled DAGs, processing functions, and vocabulary shape in-context learning dynamics and provide actionable guidance for future investigations into reasoning in large language models.

Abstract

We introduce CoT-ICL Lab, a framework and methodology to generate synthetic tokenized datasets and systematically study chain-of-thought (CoT) in-context learning (ICL) in language models. CoT-ICL Lab allows fine grained control over the complexity of in-context examples by decoupling (1) the causal structure involved in chain token generation from (2) the underlying token processing functions. We train decoder-only transformers (up to 700M parameters) on these datasets and show that CoT accelerates the accuracy transition to higher values across model sizes. In particular, we find that model depth is crucial for leveraging CoT with limited in-context examples, while more examples help shallow models match deeper model performance. Additionally, limiting the diversity of token processing functions throughout training improves causal structure learning via ICL. We also interpret these transitions by analyzing transformer embeddings and attention maps. Overall, CoT-ICL Lab serves as a simple yet powerful testbed for theoretical and empirical insights into ICL and CoT in language models.

CoT-ICL Lab: A Synthetic Framework for Studying Chain-of-Thought Learning from In-Context Demonstrations

TL;DR

CoT-ICL Lab presents a tokenized, synthetic framework to study chain-of-thought in-context learning by decoupling the causal structure from the token-processing functions . It enables controlled experiments with DAG-based reasoning, multi-input ICL sequences, and varied token processors, trained on decoder-only transformers up to parameters. The main findings show that chain-of-thought prompts accelerate accuracy transitions across model sizes, with depth being crucial when in-context demonstrations are limited and more examples aiding shallower models; constraining the diversity of token processors can improve causal-structure learning, and embedding/alignment and attention analyses reveal how models infer the DAG. The framework also reveals connections to NLP, such as faster adaptation for pre-trained models and sparse attention patterns in reasoning, offering a versatile testbed for theoretical and empirical exploration of CoT-ICL in language tasks. While synthetic, the results highlight how controlled DAGs, processing functions, and vocabulary shape in-context learning dynamics and provide actionable guidance for future investigations into reasoning in large language models.

Abstract

We introduce CoT-ICL Lab, a framework and methodology to generate synthetic tokenized datasets and systematically study chain-of-thought (CoT) in-context learning (ICL) in language models. CoT-ICL Lab allows fine grained control over the complexity of in-context examples by decoupling (1) the causal structure involved in chain token generation from (2) the underlying token processing functions. We train decoder-only transformers (up to 700M parameters) on these datasets and show that CoT accelerates the accuracy transition to higher values across model sizes. In particular, we find that model depth is crucial for leveraging CoT with limited in-context examples, while more examples help shallow models match deeper model performance. Additionally, limiting the diversity of token processing functions throughout training improves causal structure learning via ICL. We also interpret these transitions by analyzing transformer embeddings and attention maps. Overall, CoT-ICL Lab serves as a simple yet powerful testbed for theoretical and empirical insights into ICL and CoT in language models.

Paper Structure

This paper contains 75 sections, 9 equations, 32 figures, 3 tables, 1 algorithm.

Figures (32)

  • Figure 1: CoT-ICL Lab overview (right) and comparison to CoT ICL in NLP (left). The figure on the left illustrates a scenario where $2$ CoT examples (colored green and blue) are available in-context along with a question (colored in orange). A corresponding scenario using CoT-ICL Lab is presented on the right where we model the causal structure via the DAG $g \in {\mathcal{G}}$ and process the data embeddings ${\mathbf{E}}_{data}$ using the token processor function $h \in {\mathcal{H}}$.
  • Figure 2: ${\texttt{accuracy}}$ by varying ${\mathcal{V}}$ with ${\mathcal{G}}(M=4,N=4,C=2),{\mathcal{H}}(1, \texttt{LeakyRelu}), d=10, K=30$.
  • Figure 3: ${\texttt{accuracy}}$ by varying ${\mathcal{V}}$ with ${\mathcal{G}}(M=4,N=4,C=2),{\mathcal{H}}(1, \texttt{LeakyRelu}), d=10, K=40$.
  • Figure 4: $\texttt{sim}({\mathbf{E}}_{\texttt{data}}, {\mathbf{E}}_{\texttt{TF}})$ by varying ${\mathcal{V}}$ with ${\mathcal{G}}(M=4,N=4,C=2), {\mathcal{H}}(1, \texttt{LeakyRelu}), d=10, K=40$.
  • Figure 5: ${\texttt{accuracy}}$ by varying $C$ with ${\mathcal{G}}(N=4,M=4),{\mathcal{H}}(1, \texttt{LeakyRelu}), d=10,|{\mathcal{V}}|=1024,K=40$.
  • ...and 27 more figures

Theorems & Definitions (1)

  • Definition A.1