Table of Contents
Fetching ...

Time Series Foundation Models for Energy Load Forecasting on Consumer Hardware: A Multi-Dimensional Zero-Shot Benchmark

Luigi Simeone

TL;DR

This study tackles the practical question of whether Time Series Foundation Models (TSFMs) can reliably forecast electricity demand in real-world, resource-constrained settings. By benchmarking four TSFMs against Prophet and statistical baselines on ERCOT data with CPU-only inference, the authors map context-length, calibration, robustness, and prescriptive utility across 2,352 forecasts. Key findings show TSFMs achieving $MASE$ around $0.31$–$0.33$ at longer contexts, Prophet failing under short contexts due to parameter estimation limits, and Chronos-2 providing well-calibrated uncertainty suitable for risk-aware operations. The work also demonstrates tangible operational value through prescriptive analytics (e.g., reserve-margin reductions of up to $63.8\%$ while maintaining $99.9\%$ reliability) and offers concrete guidance for model selection and deployment, alongside an open benchmark framework for reproducibility.

Abstract

Time Series Foundation Models (TSFMs) have introduced zero-shot prediction capabilities that bypass the need for task-specific training. Whether these capabilities translate to mission-critical applications such as electricity demand forecasting--where accuracy, calibration, and robustness directly affect grid operations--remains an open question. We present a multi-dimensional benchmark evaluating four TSFMs (Chronos-Bolt, Chronos-2, Moirai-2, and TinyTimeMixer) alongside Prophet as an industry-standard baseline and two statistical references (SARIMA and Seasonal Naive), using ERCOT hourly load data from 2020 to 2024. All experiments run on consumer-grade hardware (AMD Ryzen 7, 16GB RAM, no GPU). The evaluation spans four axes: (1) context length sensitivity from 24 to 2048 hours, (2) probabilistic forecast calibration, (3) robustness under distribution shifts including COVID-19 lockdowns and Winter Storm Uri, and (4) prescriptive analytics for operational decision support. The top-performing foundation models achieve MASE values near 0.31 at long context lengths (C = 2048h, day-ahead horizon), a 47% reduction over the Seasonal Naive baseline. The inclusion of Prophet exposes a structural advantage of pre-trained models: Prophet fails when the fitting window is shorter than its seasonality period (MASE > 74 at 24-hour context), while TSFMs maintain stable accuracy even with minimal context because they recognise temporal patterns learned during pre-training rather than estimating them from scratch. Calibration varies substantially across models--Chronos-2 produces well-calibrated prediction intervals (95% empirical coverage at 90% nominal level) while both Moirai-2 and Prophet exhibit overconfidence (~70% coverage). We provide practical model selection guidelines and release the complete benchmark framework for reproducibility.

Time Series Foundation Models for Energy Load Forecasting on Consumer Hardware: A Multi-Dimensional Zero-Shot Benchmark

TL;DR

This study tackles the practical question of whether Time Series Foundation Models (TSFMs) can reliably forecast electricity demand in real-world, resource-constrained settings. By benchmarking four TSFMs against Prophet and statistical baselines on ERCOT data with CPU-only inference, the authors map context-length, calibration, robustness, and prescriptive utility across 2,352 forecasts. Key findings show TSFMs achieving around at longer contexts, Prophet failing under short contexts due to parameter estimation limits, and Chronos-2 providing well-calibrated uncertainty suitable for risk-aware operations. The work also demonstrates tangible operational value through prescriptive analytics (e.g., reserve-margin reductions of up to while maintaining reliability) and offers concrete guidance for model selection and deployment, alongside an open benchmark framework for reproducibility.

Abstract

Time Series Foundation Models (TSFMs) have introduced zero-shot prediction capabilities that bypass the need for task-specific training. Whether these capabilities translate to mission-critical applications such as electricity demand forecasting--where accuracy, calibration, and robustness directly affect grid operations--remains an open question. We present a multi-dimensional benchmark evaluating four TSFMs (Chronos-Bolt, Chronos-2, Moirai-2, and TinyTimeMixer) alongside Prophet as an industry-standard baseline and two statistical references (SARIMA and Seasonal Naive), using ERCOT hourly load data from 2020 to 2024. All experiments run on consumer-grade hardware (AMD Ryzen 7, 16GB RAM, no GPU). The evaluation spans four axes: (1) context length sensitivity from 24 to 2048 hours, (2) probabilistic forecast calibration, (3) robustness under distribution shifts including COVID-19 lockdowns and Winter Storm Uri, and (4) prescriptive analytics for operational decision support. The top-performing foundation models achieve MASE values near 0.31 at long context lengths (C = 2048h, day-ahead horizon), a 47% reduction over the Seasonal Naive baseline. The inclusion of Prophet exposes a structural advantage of pre-trained models: Prophet fails when the fitting window is shorter than its seasonality period (MASE > 74 at 24-hour context), while TSFMs maintain stable accuracy even with minimal context because they recognise temporal patterns learned during pre-training rather than estimating them from scratch. Calibration varies substantially across models--Chronos-2 produces well-calibrated prediction intervals (95% empirical coverage at 90% nominal level) while both Moirai-2 and Prophet exhibit overconfidence (~70% coverage). We provide practical model selection guidelines and release the complete benchmark framework for reproducibility.
Paper Structure (28 sections, 3 equations, 13 figures, 5 tables)

This paper contains 28 sections, 3 equations, 13 figures, 5 tables.

Figures (13)

  • Figure 1: MASE versus context length for all models at both forecast horizons. The dashed horizontal line marks MASE = 1.0 (Seasonal Naive equivalence). Foundation models (solid lines) maintain MASE below 1.0 even at 24-hour context; Prophet (cyan) and SARIMA (orange) require $\geq$168 hours to reach comparable accuracy.
  • Figure 2: MASE heatmap by model, context length, and test period for $H = 24$. Darker green indicates lower (better) MASE. Prophet and SARIMA produce extreme errors (red cells) at short context lengths, while foundation models remain consistently in the green range.
  • Figure 3: Model comparison at context length $C = 512$ h, $H = 24$ h. Error bars denote 95% confidence intervals across test windows. At this context length, all models beat the Seasonal Naive baseline, with Chronos-Bolt, Chronos-2, and Moirai-2 forming a statistically indistinguishable cluster.
  • Figure 4: Left: reliability diagram showing empirical versus nominal coverage. The diagonal represents perfect calibration. Chronos-2 closely tracks the diagonal; Moirai-2 and Prophet fall below it (overconfidence); TTM sits at 100% coverage across all levels (uninformative intervals). Right: normalised prediction interval width by model and confidence level. TTM intervals are 3--4$\times$ wider than Chronos intervals at the same nominal level.
  • Figure 5: Example probabilistic forecasts for a 24-hour period. Shaded bands represent 50%, 80%, and 90% prediction intervals. Chronos-Bolt and Chronos-2 produce tight, accurate intervals. TTM intervals are wide enough to be uninformative. Prophet intervals are relatively tight but miss the actual demand trajectory, consistent with its overconfident calibration.
  • ...and 8 more figures