Table of Contents
Fetching ...

STREAM: Spatio-TempoRal Evaluation and Analysis Metric for Video Generative Models

Pum Jun Kim, Seojun Kim, Jaejun Yoo

TL;DR

STREAM addresses the inadequacy of existing video evaluation metrics by decoupling spatial realism from temporal naturalness. It leverages per-frame image embeddings and FFT-based analysis to compute STREAM-T for temporal flow and STREAM-S (with STREAM-F and STREAM-D) for spatial fidelity and diversity, enabling length-agnostic assessments. The approach yields bounded, interpretable scores and demonstrates strong correlation with human judgments, while revealing weaknesses in current video generative models, especially for longer sequences. The work offers a practical, implementable tool with code available at https://github.com/pro2nit/STREAM, to guide development of more realistic and temporally coherent video generation systems.

Abstract

Image generative models have made significant progress in generating realistic and diverse images, supported by comprehensive guidance from various evaluation metrics. However, current video generative models struggle to generate even short video clips, with limited tools that provide insights for improvements. Current video evaluation metrics are simple adaptations of image metrics by switching the embeddings with video embedding networks, which may underestimate the unique characteristics of video. Our analysis reveals that the widely used Frechet Video Distance (FVD) has a stronger emphasis on the spatial aspect than the temporal naturalness of video and is inherently constrained by the input size of the embedding networks used, limiting it to 16 frames. Additionally, it demonstrates considerable instability and diverges from human evaluations. To address the limitations, we propose STREAM, a new video evaluation metric uniquely designed to independently evaluate spatial and temporal aspects. This feature allows comprehensive analysis and evaluation of video generative models from various perspectives, unconstrained by video length. We provide analytical and experimental evidence demonstrating that STREAM provides an effective evaluation tool for both visual and temporal quality of videos, offering insights into area of improvement for video generative models. To the best of our knowledge, STREAM is the first evaluation metric that can separately assess the temporal and spatial aspects of videos. Our code is available at https://github.com/pro2nit/STREAM.

STREAM: Spatio-TempoRal Evaluation and Analysis Metric for Video Generative Models

TL;DR

STREAM addresses the inadequacy of existing video evaluation metrics by decoupling spatial realism from temporal naturalness. It leverages per-frame image embeddings and FFT-based analysis to compute STREAM-T for temporal flow and STREAM-S (with STREAM-F and STREAM-D) for spatial fidelity and diversity, enabling length-agnostic assessments. The approach yields bounded, interpretable scores and demonstrates strong correlation with human judgments, while revealing weaknesses in current video generative models, especially for longer sequences. The work offers a practical, implementable tool with code available at https://github.com/pro2nit/STREAM, to guide development of more realistic and temporally coherent video generation systems.

Abstract

Image generative models have made significant progress in generating realistic and diverse images, supported by comprehensive guidance from various evaluation metrics. However, current video generative models struggle to generate even short video clips, with limited tools that provide insights for improvements. Current video evaluation metrics are simple adaptations of image metrics by switching the embeddings with video embedding networks, which may underestimate the unique characteristics of video. Our analysis reveals that the widely used Frechet Video Distance (FVD) has a stronger emphasis on the spatial aspect than the temporal naturalness of video and is inherently constrained by the input size of the embedding networks used, limiting it to 16 frames. Additionally, it demonstrates considerable instability and diverges from human evaluations. To address the limitations, we propose STREAM, a new video evaluation metric uniquely designed to independently evaluate spatial and temporal aspects. This feature allows comprehensive analysis and evaluation of video generative models from various perspectives, unconstrained by video length. We provide analytical and experimental evidence demonstrating that STREAM provides an effective evaluation tool for both visual and temporal quality of videos, offering insights into area of improvement for video generative models. To the best of our knowledge, STREAM is the first evaluation metric that can separately assess the temporal and spatial aspects of videos. Our code is available at https://github.com/pro2nit/STREAM.
Paper Structure (42 sections, 6 equations, 27 figures, 8 tables, 1 algorithm)

This paper contains 42 sections, 6 equations, 27 figures, 8 tables, 1 algorithm.

Figures (27)

  • Figure 1: An illustration of the proposed evaluation pipeline. We use image embedding space to evaluate video regardless of its length and to consider the spatial and temporal aspects of video independently (Section \ref{['sec:embedding']}). Then, we use the Fast Fourier Transform (FFT) along the temporal axis of frame features to capture the variation over time for evaluation and to utilize the average at frequency zero for spatial evaluation (Section \ref{['sec:notation']}). Finally, we calculate STREAM-S and STREAM-T to evaluate the video generative models (Section \ref{['sec:STREAM-S']} and \ref{['sec:STREAM-T']}).
  • Figure 2: Behavior of STREAM regarding noise affecting the "visual quality". All noise used in the experiment is equally added to entire video frames. Luminance shift decreases the contrast of all frames as the intensity increases. Color jitter is applied by randomly sampling colors for each video sample, thereby affecting the overall color tone of the video.
  • Figure 3: Comparison of the behaviors of STREAM and FVD when changes are introduced to the "temporal flow" of video data. As in the example, local swap involves swapping the orders of two randomly selected frames within the video, while global swap entails exchanging a randomly chosen frame with a frame from another video.
  • Figure 4: Behaviors of STREAM and FVD in response to various temporal flow modifications. Random translation applies random directional shifts to each video frame. The translation intensity indicates the number of pixels to be shifted in a random direction. Replacement of video with stop scenes replaces a certain proportion of videos in the dataset with video containing only still frames.
  • Figure 5: Behavior of STREAM when noise affecting the "visual quality" is applied to the real-world data (UCF-101). All noise used in the experiment is equally added to entire video frames. Color jitter is applied by randomly sampling color filters for each video, which alters the overall color tone of the video.
  • ...and 22 more figures