ContRail: A Framework for Realistic Railway Image Synthesis using ControlNet
Andrei-Robert Alexandrescu, Razvan-Gabriel Petec, Alexandru Manole, Laura-Silvia Diosan
TL;DR
ContRail tackles data scarcity in railway scene understanding by generating realistic ego-view railway images with a ControlNet-based, multi-modal conditioning framework. It combines segmentation masks and Canny edges as input conditions and leverages BLIP-2-derived prompts to guide stable diffusion, achieving realistic synthesis while enabling a single model to handle multiple conditioning signals. Quantitative results show that synthetic data can improve rail segmentation performance, with best gains when synthetic images accompany real data on the same ground-truth masks, and a low Fréchet Inception Distance indicating realism. The approach demonstrates practical impact by enabling targeted data augmentation for domain-specific vision tasks, potentially improving robustness in rare or hard-to-capture railway scenarios.
Abstract
Deep Learning became an ubiquitous paradigm due to its extraordinary effectiveness and applicability in numerous domains. However, the approach suffers from the high demand of data required to achieve the potential of this type of model. An ever-increasing sub-field of Artificial Intelligence, Image Synthesis, aims to address this limitation through the design of intelligent models capable of creating original and realistic images, endeavour which could drastically reduce the need for real data. The Stable Diffusion generation paradigm recently propelled state-of-the-art approaches to exceed all previous benchmarks. In this work, we propose the ContRail framework based on the novel Stable Diffusion model ControlNet, which we empower through a multi-modal conditioning method. We experiment with the task of synthetic railway image generation, where we improve the performance in rail-specific tasks, such as rail semantic segmentation by enriching the dataset with realistic synthetic images.
