Larger Language Models Don't Care How You Think: Why Chain-of-Thought Prompting Fails in Subjective Tasks
Georgios Chochlakis, Niyantha Maruthu Pandiyan, Kristina Lerman, Shrikanth Narayanan
TL;DR
This work examines whether Chain-of-Thought prompting can mitigate the strong pull of task priors in subjective, multilabel tasks. Using MFRC and GoEmotions across six state-of-the-art LLMs, it formalizes task-prior proxies and compares CoT to standard ICL via similarity to ground truth and priors, plus manual reasoning assessments. The main finding is that CoT does not improve performance for larger models; these models develop reasoning priors that resemble CoT priors and can dominate evidence, leading to posterior predictions that align with priors rather than ground truth. The results challenge the efficacy of CoT for subjective tasks and highlight the need for new prompting or learning strategies to address priors in emotion and morality applications, with a formal framing around mappings $f:\\mathcal{X}\to \\mathcal{Y}$ and priors $p$, $p^I$, $p^{I,r}$, $p^{I,y}$.
Abstract
In-Context Learning (ICL) in Large Language Models (LLM) has emerged as the dominant technique for performing natural language tasks, as it does not require updating the model parameters with gradient-based methods. ICL promises to "adapt" the LLM to perform the present task at a competitive or state-of-the-art level at a fraction of the computational cost. ICL can be augmented by incorporating the reasoning process to arrive at the final label explicitly in the prompt, a technique called Chain-of-Thought (CoT) prompting. However, recent work has found that ICL relies mostly on the retrieval of task priors and less so on "learning" to perform tasks, especially for complex subjective domains like emotion and morality, where priors ossify posterior predictions. In this work, we examine whether "enabling" reasoning also creates the same behavior in LLMs, wherein the format of CoT retrieves reasoning priors that remain relatively unchanged despite the evidence in the prompt. We find that, surprisingly, CoT indeed suffers from the same posterior collapse as ICL for larger language models. Code is avalaible at https://github.com/gchochla/cot-priors.
