Table of Contents
Fetching ...

Impact of Noise on LLM-Models Performance in Abstraction and Reasoning Corpus (ARC) Tasks with Model Temperature Considerations

Nikhil Khandalkar, Pavan Yadav, Krishna Shinde, Lokesh B. Ramegowda, Rajarshi Das

TL;DR

This work tackles the robustness of LLM-based abstraction and reasoning on ARC tasks under input/output perturbations, systematically varying noise level and model temperature. By selecting a representative task subset and employing 1×, 3×, and 9× k-shot regimes, the study quantifies how perturbations degrade exact and partial matching metrics and how prompt design and determinism influence resilience. Across architectures, results show pronounced vulnerability to even small noise, with deterministic prompts and explicit noise-awareness offering modest improvements; large cross-model gaps persist (e.g., GPT-4o vs. LLaMA). The findings underscore the need for robust reasoning mechanisms that generalize under uncertainty, with practical implications for deploying AI systems in noisy real-world environments and guiding future robustness-centric training and architectural innovations.

Abstract

Recent advancements in Large Language Models (LLMs) have generated growing interest in their structured reasoning capabilities, particularly in tasks involving abstraction and pattern recognition. The Abstraction and Reasoning Corpus (ARC) benchmark plays a crucial role in evaluating these capabilities by testing how well AI models generalize to novel problems. While GPT-4o demonstrates strong performance by solving all ARC tasks under zero-noise conditions, other models like DeepSeek R1 and LLaMA 3.2 fail to solve any, suggesting limitations in their ability to reason beyond simple pattern matching. To explore this gap, we systematically evaluate these models across different noise levels and temperature settings. Our results reveal that the introduction of noise consistently impairs model performance, regardless of architecture. This decline highlights a shared vulnerability: current LLMs, despite showing signs of abstract reasoning, remain highly sensitive to input perturbations. Such fragility raises concerns about their real-world applicability, where noise and uncertainty are common. By comparing how different model architectures respond to these challenges, we offer insights into the structural weaknesses of modern LLMs in reasoning tasks. This work underscores the need for developing more robust and adaptable AI systems capable of handling the ambiguity and variability inherent in real-world scenarios. Our findings aim to guide future research toward enhancing model generalization, robustness, and alignment with human-like cognitive flexibility.

Impact of Noise on LLM-Models Performance in Abstraction and Reasoning Corpus (ARC) Tasks with Model Temperature Considerations

TL;DR

This work tackles the robustness of LLM-based abstraction and reasoning on ARC tasks under input/output perturbations, systematically varying noise level and model temperature. By selecting a representative task subset and employing 1×, 3×, and 9× k-shot regimes, the study quantifies how perturbations degrade exact and partial matching metrics and how prompt design and determinism influence resilience. Across architectures, results show pronounced vulnerability to even small noise, with deterministic prompts and explicit noise-awareness offering modest improvements; large cross-model gaps persist (e.g., GPT-4o vs. LLaMA). The findings underscore the need for robust reasoning mechanisms that generalize under uncertainty, with practical implications for deploying AI systems in noisy real-world environments and guiding future robustness-centric training and architectural innovations.

Abstract

Recent advancements in Large Language Models (LLMs) have generated growing interest in their structured reasoning capabilities, particularly in tasks involving abstraction and pattern recognition. The Abstraction and Reasoning Corpus (ARC) benchmark plays a crucial role in evaluating these capabilities by testing how well AI models generalize to novel problems. While GPT-4o demonstrates strong performance by solving all ARC tasks under zero-noise conditions, other models like DeepSeek R1 and LLaMA 3.2 fail to solve any, suggesting limitations in their ability to reason beyond simple pattern matching. To explore this gap, we systematically evaluate these models across different noise levels and temperature settings. Our results reveal that the introduction of noise consistently impairs model performance, regardless of architecture. This decline highlights a shared vulnerability: current LLMs, despite showing signs of abstract reasoning, remain highly sensitive to input perturbations. Such fragility raises concerns about their real-world applicability, where noise and uncertainty are common. By comparing how different model architectures respond to these challenges, we offer insights into the structural weaknesses of modern LLMs in reasoning tasks. This work underscores the need for developing more robust and adaptable AI systems capable of handling the ambiguity and variability inherent in real-world scenarios. Our findings aim to guide future research toward enhancing model generalization, robustness, and alignment with human-like cognitive flexibility.

Paper Structure

This paper contains 22 sections, 3 equations, 17 figures.

Figures (17)

  • Figure 1: Shows pictorial representation of task 272f95fa with 2-shot examples. Without introdusing the noise.
  • Figure 2: Illustration of the impact of noise on task id 272f95fa. (a) Represents the original task. (b) Represents same task with 0.125% of noise in the input grids. (c) Represents same task with 0.125% of noise in the output grids. The pictorial representation of new prompt is in appendix section \ref{['appendix:effect_noise']}
  • Figure 3: Illustration of the impact of noise on task id 272f95fa. (a) Represents the original task. (b) Represents same task with 0.125% of noise in the input grids. (c) Represents same task with 0.125% of noise in the output grids. The pictorial representation of new prompt is in appendix section \ref{['appendix:effect_noise']}
  • Figure 4: The figure presents a pictorial representation of Task ID 272f95fa, sourced from https://arcprize.org/play?task=272f95fa.
  • Figure 5: These graphs illustrate the impact of noise and model temperature on GPT-4o’s ability to solve ARC tasks, specifically for Task ID 272f95fa. The x-axis represents the noise level, the left y-axis shows correct predictions (out of 30 evaluations) and the right y-axis indicates the mean of partially correct cells, and Mean partial percentage ( out of 100% evaluation) . To evaluate the robustness of the model on Task ID: 272f95fa under varying conditions, Each subplot shows the number of correct predictions (solid lines) and mean partial correctness percentage (dotted lines) at varying noise levels for different example configurations. Subplots (a) and (b) correspond to model temperatures 0 and 1, respectively. The new prompt (right panels) demonstrates improved performance over the original prompt (left panels) at low noise levels, particularly when more examples are provided. However, at higher noise levels, both prompts perform similarly poorly. Additionally, model performance at temperature = 0 is significantly better than at temperature = 1, where variability increases and accuracy drops. Overall, the results highlight that lower noise, more examples, and lower model temperature contribute positively to model accuracy—while high noise and high temperature consistently degrade performance.
  • ...and 12 more figures