GenRec: Unifying Video Generation and Recognition with Diffusion Models

Zejia Weng; Xitong Yang; Zhen Xing; Zuxuan Wu; Yu-Gang Jiang

GenRec: Unifying Video Generation and Recognition with Diffusion Models

Zejia Weng, Xitong Yang, Zhen Xing, Zuxuan Wu, Yu-Gang Jiang

TL;DR

GenRec presents a unified diffusion-based framework that jointly learns video generation and recognition by applying random-frame conditioning and latent masking on a Stable Video Diffusion backbone. It combines generative denoising with a recognition head, balancing losses to produce high-quality video generation while maintaining competitive recognition performance, even with partially observed frames. The approach achieves strong results on standard benchmarks and demonstrates robustness in low-information scenarios, while enabling class-guided generation. This unified model showcases the potential of integrating generation and understanding in a single diffusion-based architecture for versatile video analysis.

Abstract

Video diffusion models are able to generate high-quality videos by learning strong spatial-temporal priors on large-scale datasets. In this paper, we aim to investigate whether such priors derived from a generative process are suitable for video recognition, and eventually joint optimization of generation and recognition. Building upon Stable Video Diffusion, we introduce GenRec, the first unified framework trained with a random-frame conditioning process so as to learn generalized spatial-temporal representations. The resulting framework can naturally supports generation and recognition, and more importantly is robust even when visual inputs contain limited information. Extensive experiments demonstrate the efficacy of GenRec for both recognition and generation. In particular, GenRec achieves competitive recognition performance, offering 75.8% and 87.2% accuracy on SSV2 and K400, respectively. GenRec also performs the best on class-conditioned image-to-video generation, achieving 46.5 and 49.3 FVD scores on SSV2 and EK-100 datasets. Furthermore, GenRec demonstrates extraordinary robustness in scenarios that only limited frames can be observed. Code will be available at https://github.com/wengzejia1/GenRec.

GenRec: Unifying Video Generation and Recognition with Diffusion Models

TL;DR

Abstract

Paper Structure (38 sections, 15 equations, 6 figures, 12 tables)

This paper contains 38 sections, 15 equations, 6 figures, 12 tables.

Introduction
Preliminary
GenRec
Pipeline Overview
Latent diffusion and latent masking.
Unifying generation and understanding.
Optimization
Inference for Different Downstream Tasks
Video generation conditioned on frames.
Video generation conditioned on classes.
Standard video recognition.
Video recognition with partially observed frames.
Experiments
Experimental Setup
Datasets.
...and 23 more sections

Figures (6)

Figure 1: Comparison of classical pipelines for video classification and generation tasks with our proposed GenRec method. (a) Classification: Typical video classification focus on understanding complete videos. (b) Diffusion Generation: Diffusion models learn the noise reduction trajectory from videos with varying levels of noise. These two distinct training paradigms present challenges for task unification. To bridge this gap, we propose (c) GenRec: a learning framework that processes mask frames $V_M$ using a masking function $M(\cdot)$ and noise videos $V_\sigma$ with noise sampling $\mathcal{N}(\cdot, \sigma)$, aiming to simultaneously learn video understanding and content completion with the same partially observed visual content.
Figure 2: The pipeline of our proposed video processing method. The input video is first processed by a pretrained encoder $E$ to produce a latent representation $\mathbf{z}_0$, then undergoes diffusion to generate a noisy latent $\tilde{\mathbf{z}}_t$. The random mask $\mathbf{m}$ is used to create the masked latent $\overline{\mathbf{z}_0}$. During training, the noisy latent is concatenated with the masked latent as condition and fed into a Spatial-Temporal UNet, resulting in both reconstruction and recognition outputs. The reconstructed latent can be decoded by the pretrained decoder $D$ to produce the final generated video.
Figure 3: Early action prediction on EK-100 and UCF-100 datasets, with one temporal crop.
Figure 4: Video generation case study. We generate videos given the first frame together with the classifier guidance for various categories.
Figure 5: Video generation case study. We generate videos given the first frame and the last frame.
...and 1 more figures

GenRec: Unifying Video Generation and Recognition with Diffusion Models

TL;DR

Abstract

GenRec: Unifying Video Generation and Recognition with Diffusion Models

Authors

TL;DR

Abstract

Table of Contents

Figures (6)