Scene Graph-Guided Generative AI Framework for Synthesizing and Evaluating Industrial Hazard Scenarios
Sanjay Acharjee, Abir Khan Ratul, Diego Patino, Md Nazmus Sakib
TL;DR
The paper addresses the lack of visual data for workplace hazard understanding by deriving photorealistic hazard scenes from OSHA accident narratives using a scene graph guided diffusion pipeline. It introduces a complete framework that uses GPT-4o to extract hazard rationales, semantic clustering to archetypes, LLaMA based scene graph modeling, and a graph aligned VQA evaluation to assess fidelity. The work's key contributions are the scene graph intermediate representation, the VQA Graph Score that outperforms CLIP/BLIP in semantic fidelity, and a scalable dataset with generation and evaluation tools. The results demonstrate that graph guided synthetic hazard data can support proactive safety modeling, enabling better generalization, simulation based training, and cross-domain data augmentation.
Abstract
Training vision models to detect workplace hazards accurately requires realistic images of unsafe conditions that could lead to accidents. However, acquiring such datasets is difficult because capturing accident-triggering scenarios as they occur is nearly impossible. To overcome this limitation, this study presents a novel scene graph-guided generative AI framework that synthesizes photorealistic images of hazardous scenarios grounded in historical Occupational Safety and Health Administration (OSHA) accident reports. OSHA narratives are analyzed using GPT-4o to extract structured hazard reasoning, which is converted into object-level scene graphs capturing spatial and contextual relationships essential for understanding risk. These graphs guide a text-to-image diffusion model to generate compositionally accurate hazard scenes. To evaluate the realism and semantic fidelity of the generated data, a visual question answering (VQA) framework is introduced. Across four state-of-the-art generative models, the proposed VQA Graph Score outperforms CLIP and BLIP metrics based on entropy-based validation, confirming its higher discriminative sensitivity.
