Table of Contents
Fetching ...

LSA: Localized Semantic Alignment for Enhancing Temporal Consistency in Traffic Video Generation

Mirlan Karimov, Teodora Spasojevic, Markus Braun, Julian Wiederer, Vasileios Belagiannis, Marc Pollefeys

TL;DR

Diffusion-based traffic video generation often exhibits temporal inconsistencies that hinder its utility as a scalable data engine for autonomous systems. This paper introduces Localized Semantic Alignment (LSA), a training-time regularizer that fine-tunes pre-trained video diffusion models by enforcing semantic feature consistency between ground-truth and generated frames, with emphasis on dynamic-object regions. The objective combines a localized semantic feature loss, computed from DINOv2 embeddings, with the standard diffusion loss: $\mathcal{L} = 0.9\mathcal{L}_{\text{diff}} + \lambda_{\text{feat}}\mathcal{L}_{\text{feat}}$, where $\lambda_{\text{feat}}$ is set per dataset (e.g., 100 for nuScenes, 60 for KITTI). Experiments on nuScenes and KITTI show consistent improvements in FVD, FID, and detection-based metrics (mAP and mIoU) over vanilla SVD and Ctrl-V 1-to-0, without adding inference-time overhead, indicating that a training-time semantic regularizer can yield robust temporal coherence and transferable benefits to controllable video generation pipelines.

Abstract

Controllable video generation has emerged as a versatile tool for autonomous driving, enabling realistic synthesis of traffic scenarios. However, existing methods depend on control signals at inference time to guide the generative model towards temporally consistent generation of dynamic objects, limiting their utility as scalable and generalizable data engines. In this work, we propose Localized Semantic Alignment (LSA), a simple yet effective framework for fine-tuning pre-trained video generation models. LSA enhances temporal consistency by aligning semantic features between ground-truth and generated video clips. Specifically, we compare the output of an off-the-shelf feature extraction model between the ground-truth and generated video clips localized around dynamic objects inducing a semantic feature consistency loss. We fine-tune the base model by combining this loss with the standard diffusion loss. The model fine-tuned for a single epoch with our novel loss outperforms the baselines in common video generation evaluation metrics. To further test the temporal consistency in generated videos we adapt two additional metrics from object detection task, namely mAP and mIoU. Extensive experiments on nuScenes and KITTI datasets show the effectiveness of our approach in enhancing temporal consistency in video generation without the need for external control signals during inference and any computational overheads.

LSA: Localized Semantic Alignment for Enhancing Temporal Consistency in Traffic Video Generation

TL;DR

Diffusion-based traffic video generation often exhibits temporal inconsistencies that hinder its utility as a scalable data engine for autonomous systems. This paper introduces Localized Semantic Alignment (LSA), a training-time regularizer that fine-tunes pre-trained video diffusion models by enforcing semantic feature consistency between ground-truth and generated frames, with emphasis on dynamic-object regions. The objective combines a localized semantic feature loss, computed from DINOv2 embeddings, with the standard diffusion loss: , where is set per dataset (e.g., 100 for nuScenes, 60 for KITTI). Experiments on nuScenes and KITTI show consistent improvements in FVD, FID, and detection-based metrics (mAP and mIoU) over vanilla SVD and Ctrl-V 1-to-0, without adding inference-time overhead, indicating that a training-time semantic regularizer can yield robust temporal coherence and transferable benefits to controllable video generation pipelines.

Abstract

Controllable video generation has emerged as a versatile tool for autonomous driving, enabling realistic synthesis of traffic scenarios. However, existing methods depend on control signals at inference time to guide the generative model towards temporally consistent generation of dynamic objects, limiting their utility as scalable and generalizable data engines. In this work, we propose Localized Semantic Alignment (LSA), a simple yet effective framework for fine-tuning pre-trained video generation models. LSA enhances temporal consistency by aligning semantic features between ground-truth and generated video clips. Specifically, we compare the output of an off-the-shelf feature extraction model between the ground-truth and generated video clips localized around dynamic objects inducing a semantic feature consistency loss. We fine-tune the base model by combining this loss with the standard diffusion loss. The model fine-tuned for a single epoch with our novel loss outperforms the baselines in common video generation evaluation metrics. To further test the temporal consistency in generated videos we adapt two additional metrics from object detection task, namely mAP and mIoU. Extensive experiments on nuScenes and KITTI datasets show the effectiveness of our approach in enhancing temporal consistency in video generation without the need for external control signals during inference and any computational overheads.
Paper Structure (14 sections, 5 equations, 3 figures, 5 tables)

This paper contains 14 sections, 5 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Improved ego-motion by fine-tuning SVD svd with our LSA framework. Left: Ego trajectories estimated by VGGT vggt from the generated videos show that semantic feature alignment via LSA framework greatly improves temporal consistency and yields more accurate ego-motion that closely follows the ground truth trajectory. Right: The improvement is also evident in the generated videos from nuScenes nuscenes, where the SVD that is fine-tuned without LSA exhibits drift into an unnatural path and degraded frame quality over time.
  • Figure 2: Overview of LSA, our proposed framework for improving temporal consistency in video generation. LSA introduces a semantic feature consistency loss that enforces alignment between the semantic representations of SVD-generated frames $\hat{\mathbf{x}}$ and their corresponding ground-truth frames $\mathbf{x}_{0}$, specifically within dynamic-object regions defined by ground-truth bounding boxes $\mathbf{bb}_{\text{gt}}$, promoting appearance consistency and temporally stable localization. Semantic features are extracted with DINOv2 dinov2, while ground-truth bounding boxes provide spatial supervision. Inference stage of SVD fine-tuned with LSA is identical to the original SVD, hence not requiring bounding boxes. Example input frames are from nuScenes nuscenes.
  • Figure 3: Visual comparison of our method with SVD svd (top) and Ctrl-V 1-to-0ctrlv (bottom) on nuScenes nuscenes. Our method yields more temporal consistency in motion dynamics of surrounding traffic agents (top) and ego-motion (bottom). Red circles mark inconsistent motions, while green circles mark correct ones.