BuildingsBench: A Large-Scale Dataset of 900K Buildings and Benchmark for Short-Term Load Forecasting

Patrick Emami; Abhijeet Sahu; Peter Graf

BuildingsBench: A Large-Scale Dataset of 900K Buildings and Benchmark for Short-Term Load Forecasting

Patrick Emami, Abhijeet Sahu, Peter Graf

TL;DR

BuildingsBench tackles the lack of large-scale, diverse data for short-term load forecasting by offering Buildings-900K, a near-million-scale simulated dataset, plus a real-building evaluation suite. The approach studies zero-shot generalization and transfer learning using time-series transformers, revealing that synthetic pretraining can generalize to real commercial buildings and that fine-tuning on limited real data yields gains. Key findings include a power-law scaling of zero-shot performance with dataset size for commercial buildings and notable sim-to-real gaps in residential forecasting. The work provides a practical, open platform to advance large-scale pretraining and generalizable STLF, with implications for grid planning and building energy management.

Abstract

Short-term forecasting of residential and commercial building energy consumption is widely used in power systems and continues to grow in importance. Data-driven short-term load forecasting (STLF), although promising, has suffered from a lack of open, large-scale datasets with high building diversity. This has hindered exploring the pretrain-then-fine-tune paradigm for STLF. To help address this, we present BuildingsBench, which consists of: 1) Buildings-900K, a large-scale dataset of 900K simulated buildings representing the U.S. building stock; and 2) an evaluation platform with over 1,900 real residential and commercial buildings from 7 open datasets. BuildingsBench benchmarks two under-explored tasks: zero-shot STLF, where a pretrained model is evaluated on unseen buildings without fine-tuning, and transfer learning, where a pretrained model is fine-tuned on a target building. The main finding of our benchmark analysis is that synthetically pretrained models generalize surprisingly well to real commercial buildings. An exploration of the effect of increasing dataset size and diversity on zero-shot commercial building performance reveals a power-law with diminishing returns. We also show that fine-tuning pretrained models on real commercial and residential buildings improves performance for a majority of target buildings. We hope that BuildingsBench encourages and facilitates future research on generalizable STLF. All datasets and code can be accessed from https://github.com/NREL/BuildingsBench.

BuildingsBench: A Large-Scale Dataset of 900K Buildings and Benchmark for Short-Term Load Forecasting

TL;DR

Abstract

Paper Structure (53 sections, 10 equations, 17 figures, 7 tables, 1 algorithm)

This paper contains 53 sections, 10 equations, 17 figures, 7 tables, 1 algorithm.

Introduction
Short-Term Load Forecasting
The Buildings-900K Dataset
Feature Extraction
BuildingsBench Evaluation Platform
Real Building Datasets
Evaluation Metrics
Baselines
Benchmark Results
Zero-Shot STLF
Transfer Learning
Empirical Scaling Laws
Discussion
Findings
Residential STLF Challenges
...and 38 more sections

Figures (17)

Figure 1: BuildingsBench gallery. Top row: commercial buildings (farthest left is simulated commercial Buildings-900K data). Second and third rows are residential buildings (farthest left in the second row is simulated residential Buildings-900K data).
Figure 2: Forecast uncertainty. Ground truth time series are truncated to previous 24 hours for visibility. Light blue lines are 10 samples from the predicted distribution. a-b) Successful commercial building forecasts. c-d) Failed residential building forecasts.
Figure 3: Empirical scaling laws for zero-shot generalization on commercial buildings. Intervals are 95% stratified bootstrap CIs for the median across all buildings. a-b) Dataset scale vs. zero-shot performance. The trends appear to be power-laws with diminishing returns. c-d) Model size vs. zero-shot performance. Residential results are in App. \ref{['sec:app:residential']}.
Figure 4: Model size vs. transfer learning. Pretrained vs. pretrained + fine-tuned (FT) performance for S, M, and L transformers. Intervals are 95% stratified bootstrap CIs of the median. The Transformer-M models show the most improvement after fine-tuning. The fine-tuned Transformer-M performance is comparable to the largest models. Improvement due to fine-tuning is less pronounced for the Transformer-L models, suggesting their zero-shot performance is saturated on this task.
Figure 5: Gaussian approximation of the inverse Box-Cox. Visualization of the Gaussian distribution $\mathcal{N}(f^{-1}(\hat{\mu}_{i,j}), \Tilde{\sigma}_{i,j})$ (orange) in the un-scaled space of loads (in kWh) for computing the RPS when using Box-Cox scaling. For reasonably small standard deviations $\Tilde{\sigma}$ in the un-scaled space, this Gaussian is a reasonable approximation of the power-normal distribution given by the inverse Box-Cox (blue). c) When the model is highly uncertain so that $\Tilde{\sigma}$ is large, the power-normal is extremely right-skewed. In this case, the Gaussian approximation is inaccurate.
...and 12 more figures

BuildingsBench: A Large-Scale Dataset of 900K Buildings and Benchmark for Short-Term Load Forecasting

TL;DR

Abstract

BuildingsBench: A Large-Scale Dataset of 900K Buildings and Benchmark for Short-Term Load Forecasting

Authors

TL;DR

Abstract

Table of Contents

Figures (17)