Synthetic History: Evaluating Visual Representations of the Past in Diffusion Models
Maria-Teresa De Rosa Palmini, Eva Cetinic
TL;DR
This work addresses how text-to-image diffusion systems depict historical contexts, a dimension previously underexplored and prone to distortions. It introduces HistVis, a benchmark consisting of 30,000 synthetic images generated from three diffusion systems across 100 prompts spanning 20 activities and 10 historical periods, with a reproducible evaluation protocol. The study analyzes three dimensions—Implicit Stylistic Associations, Historical Consistency, and Demographic Representation—revealing systematic biases: strong period-specific visual defaults, frequent anachronisms, and demographic patterns that diverge from historically plausible baselines. By providing a robust, open benchmark and comprehensive analyses, the paper offers a foundation for improving historical fidelity and bias mitigation in diffusion-based visual generation, with implications for education, cultural heritage, and public understanding of the past.
Abstract
As Text-to-Image (TTI) diffusion models become increasingly influential in content creation, growing attention is being directed toward their societal and cultural implications. While prior research has primarily examined demographic and cultural biases, the ability of these models to accurately represent historical contexts remains largely underexplored. To address this gap, we introduce a benchmark for evaluating how TTI models depict historical contexts. The benchmark combines HistVis, a dataset of 30,000 synthetic images generated by three state-of-the-art diffusion models from carefully designed prompts covering universal human activities across multiple historical periods, with a reproducible evaluation protocol. We evaluate generated imagery across three key aspects: (1) Implicit Stylistic Associations: examining default visual styles associated with specific eras; (2) Historical Consistency: identifying anachronisms such as modern artifacts in pre-modern contexts; and (3) Demographic Representation: comparing generated racial and gender distributions against historically plausible baselines. Our findings reveal systematic inaccuracies in historically themed generated imagery, as TTI models frequently stereotype past eras by incorporating unstated stylistic cues, introduce anachronisms, and fail to reflect plausible demographic patterns. By providing a reproducible benchmark for historical representation in generated imagery, this work provides an initial step toward building more historically accurate TTI models.
