Table of Contents
Fetching ...

ContRail: A Framework for Realistic Railway Image Synthesis using ControlNet

Andrei-Robert Alexandrescu, Razvan-Gabriel Petec, Alexandru Manole, Laura-Silvia Diosan

TL;DR

ContRail tackles data scarcity in railway scene understanding by generating realistic ego-view railway images with a ControlNet-based, multi-modal conditioning framework. It combines segmentation masks and Canny edges as input conditions and leverages BLIP-2-derived prompts to guide stable diffusion, achieving realistic synthesis while enabling a single model to handle multiple conditioning signals. Quantitative results show that synthetic data can improve rail segmentation performance, with best gains when synthetic images accompany real data on the same ground-truth masks, and a low Fréchet Inception Distance indicating realism. The approach demonstrates practical impact by enabling targeted data augmentation for domain-specific vision tasks, potentially improving robustness in rare or hard-to-capture railway scenarios.

Abstract

Deep Learning became an ubiquitous paradigm due to its extraordinary effectiveness and applicability in numerous domains. However, the approach suffers from the high demand of data required to achieve the potential of this type of model. An ever-increasing sub-field of Artificial Intelligence, Image Synthesis, aims to address this limitation through the design of intelligent models capable of creating original and realistic images, endeavour which could drastically reduce the need for real data. The Stable Diffusion generation paradigm recently propelled state-of-the-art approaches to exceed all previous benchmarks. In this work, we propose the ContRail framework based on the novel Stable Diffusion model ControlNet, which we empower through a multi-modal conditioning method. We experiment with the task of synthetic railway image generation, where we improve the performance in rail-specific tasks, such as rail semantic segmentation by enriching the dataset with realistic synthetic images.

ContRail: A Framework for Realistic Railway Image Synthesis using ControlNet

TL;DR

ContRail tackles data scarcity in railway scene understanding by generating realistic ego-view railway images with a ControlNet-based, multi-modal conditioning framework. It combines segmentation masks and Canny edges as input conditions and leverages BLIP-2-derived prompts to guide stable diffusion, achieving realistic synthesis while enabling a single model to handle multiple conditioning signals. Quantitative results show that synthetic data can improve rail segmentation performance, with best gains when synthetic images accompany real data on the same ground-truth masks, and a low Fréchet Inception Distance indicating realism. The approach demonstrates practical impact by enabling targeted data augmentation for domain-specific vision tasks, potentially improving robustness in rare or hard-to-capture railway scenarios.

Abstract

Deep Learning became an ubiquitous paradigm due to its extraordinary effectiveness and applicability in numerous domains. However, the approach suffers from the high demand of data required to achieve the potential of this type of model. An ever-increasing sub-field of Artificial Intelligence, Image Synthesis, aims to address this limitation through the design of intelligent models capable of creating original and realistic images, endeavour which could drastically reduce the need for real data. The Stable Diffusion generation paradigm recently propelled state-of-the-art approaches to exceed all previous benchmarks. In this work, we propose the ContRail framework based on the novel Stable Diffusion model ControlNet, which we empower through a multi-modal conditioning method. We experiment with the task of synthetic railway image generation, where we improve the performance in rail-specific tasks, such as rail semantic segmentation by enriching the dataset with realistic synthetic images.

Paper Structure

This paper contains 19 sections, 2 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Overview of the proposed railway scene generation pipeline using ControlNet zhang2023adding and BLIP-2 li2023blip. RailSem19 provides real images and their semantic segmentation masks. Starting from the real image we can obtain the edges using the Canny algorithm. The segmentation mask is combined with the edge image as showed in Fig. \ref{['fig:conditional_representation']}. The original image is also used as input for the BLIP2 model which obtains its textual description, which is used for prompting purposes. The text and the combined conditional representation is fed into the ControlNet architecture, resulting in a realistic synthetic image.
  • Figure 2: Conditional representations used as input.
  • Figure 3: First row: Segmentation masks; Second row: results obtained using combined masks Cmb111 without training prompts.
  • Figure 4: First row: Segmentation masks; Second row: results obtained using original masks with BLIP-2 prompts.
  • Figure 5: First row: Segmentation masks; Second row: results obtained using original masks with BLIP-2+Negative prompts.