Understanding the Effect of Noise in LLM Training Data with Algorithmic Chains of Thought

Alex Havrilla; Maia Iyer

Understanding the Effect of Noise in LLM Training Data with Algorithmic Chains of Thought

Alex Havrilla, Maia Iyer

TL;DR

This paper investigates how noise in chain-of-thought (CoT) data affects downstream performance by introducing the Traced Integer (TInt) framework to generate programmable, noisy CoT traces for algorithmic tasks. It distinguishes static (local) and dynamic (global) noise, and studies their impact during fine-tuning and prompting. Key findings show that models are remarkably robust to static CoT noise, while dynamic noise severely degrades performance, especially for complex operations and prompting scenarios. The results suggest prioritizing the removal of dynamically noisy CoT samples during training and offer insights into CoT design, trace visibility, and robustness considerations for future LLM training and distillation. Overall, the work provides actionable guidance for noise filtering and contributes a flexible methodology for controlled CoT noise experimentation.

Abstract

During both pretraining and fine-tuning, Large Language Models (\textbf{LLMs}) are trained on trillions of tokens of text of widely varying quality. Both phases of training typically involve heuristically filtering out ``low-quality'' or \textit{noisy} training samples, yet little is known quantitatively about how the type or intensity of noise affects downstream performance. In this work, we study how noise in chain of thought (\textbf{CoT}) impacts task performance in the highly-controlled setting of algorithmically solvable tasks. First, we develop the Traced Integer (\textbf{TInt}) framework to generate highly customizable noised execution traces for any arithmetic function on lists of integers. We then define two types of noise: \textit{static} noise, a local form of noise which is applied after the CoT trace is computed, and \textit{dynamic} noise, a global form of noise which propagates errors in the trace as it is computed. We then evaluate the test performance of pretrained models both prompted and fine-tuned on noised datasets with varying levels of dataset contamination and intensity. We find fine-tuned models are extremely robust to high levels of static noise but struggle significantly more with lower levels of dynamic noise. In contrast, few-shot prompted models appear more sensitive to even static noise. We conclude with a discussion of how our findings impact noise filtering best-practices, in particular emphasizing the importance of removing samples containing destructive dynamic noise with global errors.

Understanding the Effect of Noise in LLM Training Data with Algorithmic Chains of Thought

TL;DR

Abstract

Paper Structure (13 sections, 13 figures, 5 tables)

This paper contains 13 sections, 13 figures, 5 tables.

Introduction
Related Work
Methods: Generating Noisy Algorithmic Chain of Thought
The TInt Framework
Adding noise to algorithmic CoT
Experiments
Fine-tuning on Noisy Chain of Thought
Prompting with Noisy Chain of Thought
Conclusion and Broader Impact
Baselines for Noise-Free Algorithmic Chain of Thought
Additional Results for Noisy Fine-tuning
Additional Results for Noisy Prompting
CoT Design

Figures (13)

Figure 1: Example of an LLM preventing error propagation by attending to all prior steps. Naievely, the LLM might predict 0 = 40 after the incorrect Step 3. However, Steps 1 and 2 act as a mechanism to "downvote" the bad influence of the incorrect Step 3 while simultaneously "upvoting" the correct Step 4 prediction.
Figure 2: Example prefix of an algorithmic CoT for addition.
Figure 3: Test accuracy on up to 10-digit addition, 10-digit subtraction, 5-digit multiplication, 5-digit division, and up to 10 number median finding. Using Algorithmic CoT significantly improves performance on all tasks. Note: The equal mixture task trains and evaluations models on all four arithmetic operations. Note: Direct training is done for 100 epochs (10 more than for CoT).
Figure 4: Plot of model test accuracy on len 1-10 addition and median vs. the dataset noise level (percent of samples with noise). Multiple levels of character noise intensity $n_c$ (percent of randomly flipped digits in a noised sample) are plotted. Algorithmic CoT retains perfect accuracy on all noise levels except when $n_d = 1.0, n_c = 0.9$. CoT refers to experiments with CoT training data while Direct refers to experiments with training data of only the answer. The median task also remains unaffected by noise when the character noise intensity is low ($< 0.5$) but appears more sensitive at high levels of dataset contamination and noise intensity.
Figure 5: Plot of model test accuracy on len 1-10 addition and median vs. the dataset noise level (percent of samples with noise). Multiple levels of line noise intensity $n_l$ (percent of randomly deleted lines in a noised sample) are plotted. As with character level noise, both tasks are robust to lower levels of line noise intensity and datase noise level, with addition being extermely robust at even higher levels.
...and 8 more figures

Understanding the Effect of Noise in LLM Training Data with Algorithmic Chains of Thought

TL;DR

Abstract

Understanding the Effect of Noise in LLM Training Data with Algorithmic Chains of Thought

Authors

TL;DR

Abstract

Table of Contents

Figures (13)