Table of Contents
Fetching ...

Deciphering the Factors Influencing the Efficacy of Chain-of-Thought: Probability, Memorization, and Noisy Reasoning

Akshara Prabhakar, Thomas L. Griffiths, R. Thomas McCoy

TL;DR

The paper investigates what drives Chain-of-Thought prompting effectiveness in large language models by using a controlled shift-cipher decoding task. It demonstrates that CoT performance is shaped by output probability, memorization from pretraining, and a noisy, bidirectional reasoning process, with Number-CoT achieving near-perfect results on GPT-4 and suggesting CoT is not pure symbolic reasoning. Through logistic regression and careful data design, the work disentangles probabilistic, memorization, and noise components, offering a nuanced view that combines elements of memorization and genuine reasoning. The findings inform how we interpret CoT results and design prompts, highlighting the role of priors and training data frequency in shaping step-by-step reasoning.

Abstract

Chain-of-Thought (CoT) prompting has been shown to enhance the multi-step reasoning capabilities of Large Language Models (LLMs). However, debates persist about whether LLMs exhibit abstract generalization or rely on shallow heuristics when given CoT prompts. To understand the factors influencing CoT reasoning we provide a detailed case study of the symbolic reasoning task of decoding shift ciphers, where letters are shifted forward some number of steps in the alphabet. We analyze the pattern of results produced by three LLMs -- GPT-4, Claude 3, and Llama 3.1 -- performing this task using CoT prompting. By focusing on a single relatively simple task, we are able to identify three factors that systematically affect CoT performance: the probability of the task's expected output (probability), what the model has implicitly learned during pre-training (memorization), and the number of intermediate operations involved in reasoning (noisy reasoning). We show that these factors can drastically influence task accuracy across all three LLMs; e.g., when tested with GPT-4, varying the output's probability of occurrence shifts accuracy from 26% to 70%. Overall, we conclude that CoT prompting performance reflects both memorization and a probabilistic version of genuine reasoning. Code and data at this https://github.com/aksh555/deciphering_cot

Deciphering the Factors Influencing the Efficacy of Chain-of-Thought: Probability, Memorization, and Noisy Reasoning

TL;DR

The paper investigates what drives Chain-of-Thought prompting effectiveness in large language models by using a controlled shift-cipher decoding task. It demonstrates that CoT performance is shaped by output probability, memorization from pretraining, and a noisy, bidirectional reasoning process, with Number-CoT achieving near-perfect results on GPT-4 and suggesting CoT is not pure symbolic reasoning. Through logistic regression and careful data design, the work disentangles probabilistic, memorization, and noise components, offering a nuanced view that combines elements of memorization and genuine reasoning. The findings inform how we interpret CoT results and design prompts, highlighting the role of priors and training data frequency in shaping step-by-step reasoning.

Abstract

Chain-of-Thought (CoT) prompting has been shown to enhance the multi-step reasoning capabilities of Large Language Models (LLMs). However, debates persist about whether LLMs exhibit abstract generalization or rely on shallow heuristics when given CoT prompts. To understand the factors influencing CoT reasoning we provide a detailed case study of the symbolic reasoning task of decoding shift ciphers, where letters are shifted forward some number of steps in the alphabet. We analyze the pattern of results produced by three LLMs -- GPT-4, Claude 3, and Llama 3.1 -- performing this task using CoT prompting. By focusing on a single relatively simple task, we are able to identify three factors that systematically affect CoT performance: the probability of the task's expected output (probability), what the model has implicitly learned during pre-training (memorization), and the number of intermediate operations involved in reasoning (noisy reasoning). We show that these factors can drastically influence task accuracy across all three LLMs; e.g., when tested with GPT-4, varying the output's probability of occurrence shifts accuracy from 26% to 70%. Overall, we conclude that CoT prompting performance reflects both memorization and a probabilistic version of genuine reasoning. Code and data at this https://github.com/aksh555/deciphering_cot
Paper Structure (26 sections, 1 equation, 11 figures, 1 table)

This paper contains 26 sections, 1 equation, 11 figures, 1 table.

Figures (11)

  • Figure 1: Overview. (1) Task: We have LLMs decode messages written in a shift cipher, in which each letter is shifted a fixed number of positions forward in the alphabet. (2) With standard prompting, GPT-4 performs poorly across most shift levels. (3) However, GPT-4 scores nearly perfectly on an isomorphic task based on numbers rather than letters. (4) With CoT prompting, GPT-4 adopts probabilistic and memorization-influenced noisy reasoning. That is, its performance (right) combines the trends we have hypothesized for each of the three factors on the left.
  • Figure 4: Hypothetical accuracy vs. shift-level for various types of reasoning. Under noisy one-way, the model only shifts letters backward; under noisy two-way, it adopts the shorter path between going forward and backward. The hypothetical memorization accuracy is based on shift level frequencies in internet corpora. Probabilistic would involve much higher scores on high prob than low prob.
  • Figure 5: The logistic regression curve captures the overall trend exhibited by GPT-4.
  • Figure 6: Actualoverall decoding accuracy vs. faithful accuracy across shift levels showing the effects of probability. The effect is amplified for low probability outputs as seen in the larger drop in accuracy between the orange and blue bin 5 (low probability) lines.
  • Figure 7: Normalized frequency distribution vs. predicted $shift\_level$ of step answers for rot-20 to rot-23. The appearance of peaks at $26 - shift\_level$ in Math-CoT and Text-CoT prompts showcases the model's noisy attempt in taking the shorter path---i.e., moving $26-x$ shifts forward.
  • ...and 6 more figures