Understanding the Effect of Noise in LLM Training Data with Algorithmic Chains of Thought
Alex Havrilla, Maia Iyer
TL;DR
This paper investigates how noise in chain-of-thought (CoT) data affects downstream performance by introducing the Traced Integer (TInt) framework to generate programmable, noisy CoT traces for algorithmic tasks. It distinguishes static (local) and dynamic (global) noise, and studies their impact during fine-tuning and prompting. Key findings show that models are remarkably robust to static CoT noise, while dynamic noise severely degrades performance, especially for complex operations and prompting scenarios. The results suggest prioritizing the removal of dynamically noisy CoT samples during training and offer insights into CoT design, trace visibility, and robustness considerations for future LLM training and distillation. Overall, the work provides actionable guidance for noise filtering and contributes a flexible methodology for controlled CoT noise experimentation.
Abstract
During both pretraining and fine-tuning, Large Language Models (\textbf{LLMs}) are trained on trillions of tokens of text of widely varying quality. Both phases of training typically involve heuristically filtering out ``low-quality'' or \textit{noisy} training samples, yet little is known quantitatively about how the type or intensity of noise affects downstream performance. In this work, we study how noise in chain of thought (\textbf{CoT}) impacts task performance in the highly-controlled setting of algorithmically solvable tasks. First, we develop the Traced Integer (\textbf{TInt}) framework to generate highly customizable noised execution traces for any arithmetic function on lists of integers. We then define two types of noise: \textit{static} noise, a local form of noise which is applied after the CoT trace is computed, and \textit{dynamic} noise, a global form of noise which propagates errors in the trace as it is computed. We then evaluate the test performance of pretrained models both prompted and fine-tuned on noised datasets with varying levels of dataset contamination and intensity. We find fine-tuned models are extremely robust to high levels of static noise but struggle significantly more with lower levels of dynamic noise. In contrast, few-shot prompted models appear more sensitive to even static noise. We conclude with a discussion of how our findings impact noise filtering best-practices, in particular emphasizing the importance of removing samples containing destructive dynamic noise with global errors.
