Synthetic Homes: An Accessible Multimodal Pipeline for Producing Residential Building Data with Generative AI
Jackson Eshbaugh, Chetan Tiwari, Jorge Silveyra
TL;DR
Energy demand modeling requires large, often costly or restricted datasets. This work presents a modular multimodal pipeline that generates synthetic residential data by fusing open county data, image-derived descriptions, and generative AI with EnergyPlus simulations, all at a low cost (less than $1 per home for GPT usage). The four-stage workflow—scrape, process images, generate GeoJSON with inspection notes, and run EnergyPlus—includes an occlusion-based evaluation to curb hallucinations and demonstrates realism by aligning synthetic distributions with ResStock in climate zone 4A. The contributions include a generalizable, low-cost data generation framework and empirical validation of realism, enabling accessible, scalable urban energy research and potential retrofit optimization tools. This approach reduces data access barriers while supporting data-driven policy analysis and model development for UBEM and related fields.
Abstract
Computational models have emerged as powerful tools for energy modeling research, touting scalability and quantitative results. However, these models require a plethora of data, some of which can be inaccessible, expensive, or can raise privacy concerns. We introduce a modular multimodal framework to produce this data from publicly accessible images and residential information using generative Artificial Intelligence (AI). Additionally, we provide a pipeline demonstrating this framework and we evaluate its generative AI components. Our experiments show that our framework's use of AI avoids common issues with generative models and produces realistic multimodal data. By reducing dependence on costly or restricted data sources, we pave a path towards more accessible research in Machine Learning (ML) and other data-driven disciplines.
