SVS-GAN: Leveraging GANs for Semantic Video Synthesis
Khaled M. Seyam, Julian Wiederer, Markus Braun, Bin Yang
TL;DR
This work formalizes Semantic Video Synthesis (SVS) and introduces SVS-GAN, a dedicated SVS framework that autoregressively generates a video sequence from a semantic-map stream and a reference frame $I_1$. The architecture features a triple-pyramid generator that fuses past-frame appearance with current semantic guidance via SPADE blocks and an occlusion-modulated warping mechanism, together with an encoder–decoder image discriminator and a video discriminator that operate at multiple temporal scales. A novel OASIS-based supervision is integrated to improve semantic alignment, supplemented by perceptual, feature-matching, and flow-based losses to ensure both frame realism and temporal coherence. Empirically, SVS-GAN achieves state-of-the-art performance on Cityscapes Sequence and KITTI-360, running efficiently on a single GPU at $1024\times512$ with around $40\ \mathrm{ms}$ per frame, and demonstrating strong generalization to long sequences and rare object classes. Limitations include dependence on the previous frame for appearance consistency, potential world-consistency issues over long horizons, and challenges in optical-flow estimation around maneuvering edges, suggesting future work on integrating 3D world representations and improved motion modeling.
Abstract
In recent years, there has been a growing interest in Semantic Image Synthesis (SIS) through the use of Generative Adversarial Networks (GANs) and diffusion models. This field has seen innovations such as the implementation of specialized loss functions tailored for this task, diverging from the more general approaches in Image-to-Image (I2I) translation. While the concept of Semantic Video Synthesis (SVS)$\unicode{x2013}$the generation of temporally coherent, realistic sequences of images from semantic maps$\unicode{x2013}$is newly formalized in this paper, some existing methods have already explored aspects of this field. Most of these approaches rely on generic loss functions designed for video-to-video translation or require additional data to achieve temporal coherence. In this paper, we introduce the SVS-GAN, a framework specifically designed for SVS, featuring a custom architecture and loss functions. Our approach includes a triple-pyramid generator that utilizes SPADE blocks. Additionally, we employ a U-Net-based network for the image discriminator, which performs semantic segmentation for the OASIS loss. Through this combination of tailored architecture and objective engineering, our framework aims to bridge the existing gap between SIS and SVS, outperforming current state-of-the-art models on datasets like Cityscapes and KITTI-360.
