Table of Contents
Fetching ...

Evidence for Daily and Weekly Periodic Variability in GPT-4o Performance

Paul Tschisgale, Peter Wulff

TL;DR

Analysis of the temporal variability of GPT-4o's average performance indicates that, even under controlled conditions, LLM performance may vary periodically over time, calling into question the assumption of time invariance.

Abstract

Large language models (LLMs) are increasingly used in research both as tools and as objects of investigation. Much of this work implicitly assumes that LLM performance under fixed conditions (identical model snapshot, hyperparameters, and prompt) is time-invariant. If average output quality changes systematically over time, this assumption is violated, threatening the reliability, validity, and reproducibility of findings. To empirically examine this assumption, we conducted a longitudinal study on the temporal variability of GPT-4o's average performance. Using a fixed model snapshot, fixed hyperparameters, and identical prompting, GPT-4o was queried via the API to solve the same multiple-choice physics task every three hours for approximately three months. Ten independent responses were generated at each time point and their scores were averaged. Spectral (Fourier) analysis of the resulting time series revealed notable periodic variability in average model performance, accounting for approximately 20% of the total variance. In particular, the observed periodic patterns are well explained by the interaction of a daily and a weekly rhythm. These findings indicate that, even under controlled conditions, LLM performance may vary periodically over time, calling into question the assumption of time invariance. Implications for ensuring validity and replicability of research that uses or investigates LLMs are discussed.

Evidence for Daily and Weekly Periodic Variability in GPT-4o Performance

TL;DR

Analysis of the temporal variability of GPT-4o's average performance indicates that, even under controlled conditions, LLM performance may vary periodically over time, calling into question the assumption of time invariance.

Abstract

Large language models (LLMs) are increasingly used in research both as tools and as objects of investigation. Much of this work implicitly assumes that LLM performance under fixed conditions (identical model snapshot, hyperparameters, and prompt) is time-invariant. If average output quality changes systematically over time, this assumption is violated, threatening the reliability, validity, and reproducibility of findings. To empirically examine this assumption, we conducted a longitudinal study on the temporal variability of GPT-4o's average performance. Using a fixed model snapshot, fixed hyperparameters, and identical prompting, GPT-4o was queried via the API to solve the same multiple-choice physics task every three hours for approximately three months. Ten independent responses were generated at each time point and their scores were averaged. Spectral (Fourier) analysis of the resulting time series revealed notable periodic variability in average model performance, accounting for approximately 20% of the total variance. In particular, the observed periodic patterns are well explained by the interaction of a daily and a weekly rhythm. These findings indicate that, even under controlled conditions, LLM performance may vary periodically over time, calling into question the assumption of time invariance. Implications for ensuring validity and replicability of research that uses or investigates LLMs are discussed.
Paper Structure (21 sections, 2 equations, 4 figures, 1 table)

This paper contains 21 sections, 2 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Conceptual illustration of how a spectral (Fourier) analysis operates for analyzing time-series data.
  • Figure 2: Visualization of temporal variability in the score data across different time scales.
  • Figure 3: Heatmap of average accuracy as a function of weekday (rows) and time of day (columns), where time of day corresponds to measurement time points taken every three hours. The top panel shows the marginal average accuracy for each time of day, averaged over all weekdays. The right panel shows the marginal average accuracy for each weekday, averaged over all times of day. All times and weekdays are based on Central European Summer Time (CEST, UTC+2).
  • Figure 4: Power spectrum estimated via fast Fourier transformation using Welch’s method and Hann-windowing. The grey shaded band indicates the 95% permutation-based significance threshold; labeled spectral peaks exceeding this band are considered statistically significant.