Table of Contents
Fetching ...

Unveiling Reasoning Thresholds in Language Models: Scaling, Fine-Tuning, and Interpretability through Attention Maps

Yen-Che Hsiao, Abhishek Dutta

TL;DR

The paper investigates when decoder-only language models acquire reasoning abilities via in-context learning and chain-of-thought prompting. By evaluating 23 open-source LMs across commonsense (CSQA) and deductive (PrOntoQA-OOD) tasks, it identifies a critical threshold around $1.6$ billion parameters after which reasoning performance dramatically improves, with a smaller gap for specific deductions (e.g., ~$1.1$B for disjunction elimination and ~$1.5$B for proof by contradiction). It also shows that fine-tuning small models on task-specific exemplars substantially boosts reasoning, enabling correct CoT generation for several rules even without prompt exemplars, though longer chains remain challenging. An attention-map analysis reveals that successful CoT generation correlates with higher token-level attention to the next correct token and relevant parts of speech, offering a path to interpretability and targeted improvements in reasoning behavior.

Abstract

This study investigates the in-context learning capabilities of various decoder-only transformer-based language models with different model sizes and training data, including GPT2, SmolLM2, OpenELM, TinyLlama, Stable LM, and Gemma 2. We identify a critical parameter threshold (~1.6 billion), beyond which reasoning performance improves significantly in tasks such as commonsense reasoning in multiple-choice question answering and deductive reasoning. Specifically, models above this threshold achieve better success rates in chain-of-thought (CoT) prompting for deductive reasoning tasks, especially those requiring longer reasoning chains, such as proof by contradiction and disjunction elimination. To address limitations in sub-threshold models, we demonstrate that fine-tuning with task-specific exemplars substantially enhances reasoning performance, enabling accurate CoT generation even without additional exemplars in the prompt for tasks with shorter reasoning chains. Finally, our analysis of attention maps reveals that models capable of generating correct CoTs exhibit higher token-level attention scores on subsequent correct tokens and the correct parts of speech, providing interpretability insights into reasoning processes. These findings collectively advance understanding of reasoning capabilities in decoder-only transformer-based models. The code is available at: https://github.com/AnnonymousForPapers/CoT_Reasoning_Test.

Unveiling Reasoning Thresholds in Language Models: Scaling, Fine-Tuning, and Interpretability through Attention Maps

TL;DR

The paper investigates when decoder-only language models acquire reasoning abilities via in-context learning and chain-of-thought prompting. By evaluating 23 open-source LMs across commonsense (CSQA) and deductive (PrOntoQA-OOD) tasks, it identifies a critical threshold around billion parameters after which reasoning performance dramatically improves, with a smaller gap for specific deductions (e.g., ~B for disjunction elimination and ~B for proof by contradiction). It also shows that fine-tuning small models on task-specific exemplars substantially boosts reasoning, enabling correct CoT generation for several rules even without prompt exemplars, though longer chains remain challenging. An attention-map analysis reveals that successful CoT generation correlates with higher token-level attention to the next correct token and relevant parts of speech, offering a path to interpretability and targeted improvements in reasoning behavior.

Abstract

This study investigates the in-context learning capabilities of various decoder-only transformer-based language models with different model sizes and training data, including GPT2, SmolLM2, OpenELM, TinyLlama, Stable LM, and Gemma 2. We identify a critical parameter threshold (~1.6 billion), beyond which reasoning performance improves significantly in tasks such as commonsense reasoning in multiple-choice question answering and deductive reasoning. Specifically, models above this threshold achieve better success rates in chain-of-thought (CoT) prompting for deductive reasoning tasks, especially those requiring longer reasoning chains, such as proof by contradiction and disjunction elimination. To address limitations in sub-threshold models, we demonstrate that fine-tuning with task-specific exemplars substantially enhances reasoning performance, enabling accurate CoT generation even without additional exemplars in the prompt for tasks with shorter reasoning chains. Finally, our analysis of attention maps reveals that models capable of generating correct CoTs exhibit higher token-level attention scores on subsequent correct tokens and the correct parts of speech, providing interpretability insights into reasoning processes. These findings collectively advance understanding of reasoning capabilities in decoder-only transformer-based models. The code is available at: https://github.com/AnnonymousForPapers/CoT_Reasoning_Test.

Paper Structure

This paper contains 32 sections, 21 figures.

Figures (21)

  • Figure 1: Accuracy plot of different LMs solving 1221 multiple-choice questions from the validation set in the CSQA dataset talmor-etal-2019-commonsenseqa. The circle markers represent different models from GPT2 radford2019language, the upside-down triangle markers represent different models from SmolLM2 allal2024SmolLM2, the square markers represent different models from OpenELM mehta2024openelm, the plus marker represents the 1.1B model from TinyLlama zhang2024tinyllama, the star markers represent different models from Stable LM 2 bellagente2024stable, and the diamond markers represent different models from Gemma 2 team2024gemma. Different colors are used to differentiate different models from the same family of models.
  • Figure 2: Number of responses that cannot be correctly parsed to get the answer from different LMs solving 1221 multiple-choice questions from the validation set in the CSQA dataset talmor-etal-2019-commonsenseqa. The blue, orange, green, red, pink, and brown bars show the counts obtained from the models in GPT2 radford2019language, SmolLM2 allal2024SmolLM2, OpenELM mehta2024openelm, TinyLlama zhang2024tinyllama, Stable LM 2 bellagente2024stable, and Gemma 2 team2024gemma, respectively.
  • Figure 3: Accuracy of different LMs solving 100 deductive reasoning questions generated by the PrOntoQA-OOD data generation codes saparov2023testing on six different deduction rules for each model.
  • Figure 4: Visualization of the normalized token-level scores kang-shin-2023-samrank using the first head in the last layer from the gpt2 model and the gemma2-9b-it model loaded with float16. Both of the models are prompted with the CoT prompt corresponding to the first proof of the implication elimination task from the PrOntoQA-OOD dataset saparov2023testing concatenated with "Wren is a sterpus. Sterpuses are transparent. Wren is" The saturation of the background color is proportional to the normalized token-level scores kang-shin-2023-samrank as indicated in the color bar on the right of each figure.
  • Figure 5: The 7 exemplars for the CoT experiments on the CSQA dataset talmor-etal-2019-commonsenseqa adopted from wei2022chain.
  • ...and 16 more figures