Table of Contents
Fetching ...

Synthetic Homes: An Accessible Multimodal Pipeline for Producing Residential Building Data with Generative AI

Jackson Eshbaugh, Chetan Tiwari, Jorge Silveyra

TL;DR

Energy demand modeling requires large, often costly or restricted datasets. This work presents a modular multimodal pipeline that generates synthetic residential data by fusing open county data, image-derived descriptions, and generative AI with EnergyPlus simulations, all at a low cost (less than $1 per home for GPT usage). The four-stage workflow—scrape, process images, generate GeoJSON with inspection notes, and run EnergyPlus—includes an occlusion-based evaluation to curb hallucinations and demonstrates realism by aligning synthetic distributions with ResStock in climate zone 4A. The contributions include a generalizable, low-cost data generation framework and empirical validation of realism, enabling accessible, scalable urban energy research and potential retrofit optimization tools. This approach reduces data access barriers while supporting data-driven policy analysis and model development for UBEM and related fields.

Abstract

Computational models have emerged as powerful tools for energy modeling research, touting scalability and quantitative results. However, these models require a plethora of data, some of which can be inaccessible, expensive, or can raise privacy concerns. We introduce a modular multimodal framework to produce this data from publicly accessible images and residential information using generative Artificial Intelligence (AI). Additionally, we provide a pipeline demonstrating this framework and we evaluate its generative AI components. Our experiments show that our framework's use of AI avoids common issues with generative models and produces realistic multimodal data. By reducing dependence on costly or restricted data sources, we pave a path towards more accessible research in Machine Learning (ML) and other data-driven disciplines.

Synthetic Homes: An Accessible Multimodal Pipeline for Producing Residential Building Data with Generative AI

TL;DR

Energy demand modeling requires large, often costly or restricted datasets. This work presents a modular multimodal pipeline that generates synthetic residential data by fusing open county data, image-derived descriptions, and generative AI with EnergyPlus simulations, all at a low cost (less than $1 per home for GPT usage). The four-stage workflow—scrape, process images, generate GeoJSON with inspection notes, and run EnergyPlus—includes an occlusion-based evaluation to curb hallucinations and demonstrates realism by aligning synthetic distributions with ResStock in climate zone 4A. The contributions include a generalizable, low-cost data generation framework and empirical validation of realism, enabling accessible, scalable urban energy research and potential retrofit optimization tools. This approach reduces data access barriers while supporting data-driven policy analysis and model development for UBEM and related fields.

Abstract

Computational models have emerged as powerful tools for energy modeling research, touting scalability and quantitative results. However, these models require a plethora of data, some of which can be inaccessible, expensive, or can raise privacy concerns. We introduce a modular multimodal framework to produce this data from publicly accessible images and residential information using generative Artificial Intelligence (AI). Additionally, we provide a pipeline demonstrating this framework and we evaluate its generative AI components. Our experiments show that our framework's use of AI avoids common issues with generative models and produces realistic multimodal data. By reducing dependence on costly or restricted data sources, we pave a path towards more accessible research in Machine Learning (ML) and other data-driven disciplines.

Paper Structure

This paper contains 12 sections, 12 figures, 3 tables.

Figures (12)

  • Figure 1: Overview of the pipeline.
  • Figure 2: An example street view photograph and floor plan from the Northampton County database.
  • Figure 3: and forward occlusion per image results.
  • Figure 4: Occlusion statistics for and .
  • Figure 5: GPT and occlusion results.
  • ...and 7 more figures