SVS-GAN: Leveraging GANs for Semantic Video Synthesis

Khaled M. Seyam; Julian Wiederer; Markus Braun; Bin Yang

SVS-GAN: Leveraging GANs for Semantic Video Synthesis

Khaled M. Seyam, Julian Wiederer, Markus Braun, Bin Yang

TL;DR

This work formalizes Semantic Video Synthesis (SVS) and introduces SVS-GAN, a dedicated SVS framework that autoregressively generates a video sequence from a semantic-map stream and a reference frame $I_1$. The architecture features a triple-pyramid generator that fuses past-frame appearance with current semantic guidance via SPADE blocks and an occlusion-modulated warping mechanism, together with an encoder–decoder image discriminator and a video discriminator that operate at multiple temporal scales. A novel OASIS-based supervision is integrated to improve semantic alignment, supplemented by perceptual, feature-matching, and flow-based losses to ensure both frame realism and temporal coherence. Empirically, SVS-GAN achieves state-of-the-art performance on Cityscapes Sequence and KITTI-360, running efficiently on a single GPU at $1024\times512$ with around $40\ \mathrm{ms}$ per frame, and demonstrating strong generalization to long sequences and rare object classes. Limitations include dependence on the previous frame for appearance consistency, potential world-consistency issues over long horizons, and challenges in optical-flow estimation around maneuvering edges, suggesting future work on integrating 3D world representations and improved motion modeling.

Abstract

In recent years, there has been a growing interest in Semantic Image Synthesis (SIS) through the use of Generative Adversarial Networks (GANs) and diffusion models. This field has seen innovations such as the implementation of specialized loss functions tailored for this task, diverging from the more general approaches in Image-to-Image (I2I) translation. While the concept of Semantic Video Synthesis (SVS)$\unicode{x2013}$the generation of temporally coherent, realistic sequences of images from semantic maps$\unicode{x2013}$is newly formalized in this paper, some existing methods have already explored aspects of this field. Most of these approaches rely on generic loss functions designed for video-to-video translation or require additional data to achieve temporal coherence. In this paper, we introduce the SVS-GAN, a framework specifically designed for SVS, featuring a custom architecture and loss functions. Our approach includes a triple-pyramid generator that utilizes SPADE blocks. Additionally, we employ a U-Net-based network for the image discriminator, which performs semantic segmentation for the OASIS loss. Through this combination of tailored architecture and objective engineering, our framework aims to bridge the existing gap between SIS and SVS, outperforming current state-of-the-art models on datasets like Cityscapes and KITTI-360.

SVS-GAN: Leveraging GANs for Semantic Video Synthesis

TL;DR

This work formalizes Semantic Video Synthesis (SVS) and introduces SVS-GAN, a dedicated SVS framework that autoregressively generates a video sequence from a semantic-map stream and a reference frame

. The architecture features a triple-pyramid generator that fuses past-frame appearance with current semantic guidance via SPADE blocks and an occlusion-modulated warping mechanism, together with an encoder–decoder image discriminator and a video discriminator that operate at multiple temporal scales. A novel OASIS-based supervision is integrated to improve semantic alignment, supplemented by perceptual, feature-matching, and flow-based losses to ensure both frame realism and temporal coherence. Empirically, SVS-GAN achieves state-of-the-art performance on Cityscapes Sequence and KITTI-360, running efficiently on a single GPU at

with around

per frame, and demonstrating strong generalization to long sequences and rare object classes. Limitations include dependence on the previous frame for appearance consistency, potential world-consistency issues over long horizons, and challenges in optical-flow estimation around maneuvering edges, suggesting future work on integrating 3D world representations and improved motion modeling.

Abstract

the generation of temporally coherent, realistic sequences of images from semantic maps

is newly formalized in this paper, some existing methods have already explored aspects of this field. Most of these approaches rely on generic loss functions designed for video-to-video translation or require additional data to achieve temporal coherence. In this paper, we introduce the SVS-GAN, a framework specifically designed for SVS, featuring a custom architecture and loss functions. Our approach includes a triple-pyramid generator that utilizes SPADE blocks. Additionally, we employ a U-Net-based network for the image discriminator, which performs semantic segmentation for the OASIS loss. Through this combination of tailored architecture and objective engineering, our framework aims to bridge the existing gap between SIS and SVS, outperforming current state-of-the-art models on datasets like Cityscapes and KITTI-360.

Paper Structure (15 sections, 4 figures, 4 tables)

This paper contains 15 sections, 4 figures, 4 tables.

Introduction
Related Work
Semantic Image Synthesis
Video-to-Video
Method
Generator
Discriminators
Losses
Experiments
Implementation Details
Datasets
Metrics
Experimental Results
Ablation Study
Conclusion

Figures (4)

Figure 1: Illustration of our SVS-GAN’s capabilities in generating high-fidelity videos that closely mimic the details of the initial reference image while following the given sequence of semantic maps, demonstrating the model’s effectiveness in accurately synthesizing the 30th frame within the sequence.
Figure 2: Architectural illustration highlighting the generator $G$ and the discriminators $D_I$ and $D_V$. The generator features an optical flow and occlusion map network illustrated at the top, followed by a triple-pyramid structure comprising a warped image encoder, a semantic map encoder, and an image generation decoder. The decoder employs SPADE blocks to incorporate the semantic map and utilizes $\widehat{OC}_{i-1\rightarrow i}$ to modulate features passed through the skip connections using the occlusion mask fusion module. On the right, the discriminator $D_I$ employs an encoder/decoder setup for image segmentation using the OASIS loss, while above, $D_V$ functions as a video patch discriminator, aiming for temporal consistency.
Figure 3: Comparison of model performances on the Cityscapes sequence dataset: On the two displayed scenes our model, compared to V2V and WC-V2V models, demonstrates enhanced detail capture, particularly in the road surfaces and vehicles.
Figure 4: Our model demonstrates the ability to produce accurate results on long sequence lengths using the KITTI-360 dataset. The figure shows that even with rare classes like the truck, SVS-GAN maintains temporal consistency.

SVS-GAN: Leveraging GANs for Semantic Video Synthesis

TL;DR

Abstract

SVS-GAN: Leveraging GANs for Semantic Video Synthesis

Authors

TL;DR

Abstract

Table of Contents

Figures (4)