Time Series Foundation Models for Energy Load Forecasting on Consumer Hardware: A Multi-Dimensional Zero-Shot Benchmark
Luigi Simeone
TL;DR
This study tackles the practical question of whether Time Series Foundation Models (TSFMs) can reliably forecast electricity demand in real-world, resource-constrained settings. By benchmarking four TSFMs against Prophet and statistical baselines on ERCOT data with CPU-only inference, the authors map context-length, calibration, robustness, and prescriptive utility across 2,352 forecasts. Key findings show TSFMs achieving $MASE$ around $0.31$–$0.33$ at longer contexts, Prophet failing under short contexts due to parameter estimation limits, and Chronos-2 providing well-calibrated uncertainty suitable for risk-aware operations. The work also demonstrates tangible operational value through prescriptive analytics (e.g., reserve-margin reductions of up to $63.8\%$ while maintaining $99.9\%$ reliability) and offers concrete guidance for model selection and deployment, alongside an open benchmark framework for reproducibility.
Abstract
Time Series Foundation Models (TSFMs) have introduced zero-shot prediction capabilities that bypass the need for task-specific training. Whether these capabilities translate to mission-critical applications such as electricity demand forecasting--where accuracy, calibration, and robustness directly affect grid operations--remains an open question. We present a multi-dimensional benchmark evaluating four TSFMs (Chronos-Bolt, Chronos-2, Moirai-2, and TinyTimeMixer) alongside Prophet as an industry-standard baseline and two statistical references (SARIMA and Seasonal Naive), using ERCOT hourly load data from 2020 to 2024. All experiments run on consumer-grade hardware (AMD Ryzen 7, 16GB RAM, no GPU). The evaluation spans four axes: (1) context length sensitivity from 24 to 2048 hours, (2) probabilistic forecast calibration, (3) robustness under distribution shifts including COVID-19 lockdowns and Winter Storm Uri, and (4) prescriptive analytics for operational decision support. The top-performing foundation models achieve MASE values near 0.31 at long context lengths (C = 2048h, day-ahead horizon), a 47% reduction over the Seasonal Naive baseline. The inclusion of Prophet exposes a structural advantage of pre-trained models: Prophet fails when the fitting window is shorter than its seasonality period (MASE > 74 at 24-hour context), while TSFMs maintain stable accuracy even with minimal context because they recognise temporal patterns learned during pre-training rather than estimating them from scratch. Calibration varies substantially across models--Chronos-2 produces well-calibrated prediction intervals (95% empirical coverage at 90% nominal level) while both Moirai-2 and Prophet exhibit overconfidence (~70% coverage). We provide practical model selection guidelines and release the complete benchmark framework for reproducibility.
