Table of Contents
Fetching ...

UniVid: Unifying Vision Tasks with Pre-trained Video Generation Models

Lan Chen, Yuchao Gu, Qi Mao

TL;DR

UniVid investigates using a single pre-trained video generation model as a unified backbone for diverse vision tasks, addressing scalability and data-collection bottlenecks in prior sequential-vision approaches. It fine-tunes a video diffusion transformer with LoRA-based supervised fine-tuning, framing all tasks as visual sentences A,A',B,B' where the context defines the task and output modality. The method demonstrates strong cross-modal and cross-source generalization, performing both understanding and generation by simply rearranging the visual sentence order, despite training only on natural videos. Mixed-context and multi-task fine-tuning enable robust performance with minimal data (as few as 20 samples per task). This work suggests a scalable pathway for general-purpose vision models based on pre-trained video-generation backbones.

Abstract

Large language models, trained on extensive corpora, successfully unify diverse linguistic tasks within a single generative framework. Inspired by this, recent works like Large Vision Model (LVM) extend this paradigm to vision by organizing tasks into sequential visual sentences, where visual prompts serve as the context to guide outputs. However, such modeling requires task-specific pre-training across modalities and sources, which is costly and limits scalability to unseen tasks. Given that pre-trained video generation models inherently capture temporal sequence dependencies, we explore a more unified and scalable alternative: can a pre-trained video generation model adapt to diverse image and video tasks? To answer this, we propose UniVid, a framework that fine-tunes a video diffusion transformer to handle various vision tasks without task-specific modifications. Tasks are represented as visual sentences, where the context sequence defines both the task and the expected output modality. We evaluate the generalization of UniVid from two perspectives: (1) cross-modal inference with contexts composed of both images and videos, extending beyond LVM's uni-modal setting; (2) cross-source tasks from natural to annotated data, without multi-source pre-training. Despite being trained solely on natural video data, UniVid generalizes well in both settings. Notably, understanding and generation tasks can easily switch by simply reversing the visual sentence order in this paradigm. These findings highlight the potential of pre-trained video generation models to serve as a scalable and unified foundation for vision modeling. Our code will be released at https://github.com/CUC-MIPG/UniVid.

UniVid: Unifying Vision Tasks with Pre-trained Video Generation Models

TL;DR

UniVid investigates using a single pre-trained video generation model as a unified backbone for diverse vision tasks, addressing scalability and data-collection bottlenecks in prior sequential-vision approaches. It fine-tunes a video diffusion transformer with LoRA-based supervised fine-tuning, framing all tasks as visual sentences A,A',B,B' where the context defines the task and output modality. The method demonstrates strong cross-modal and cross-source generalization, performing both understanding and generation by simply rearranging the visual sentence order, despite training only on natural videos. Mixed-context and multi-task fine-tuning enable robust performance with minimal data (as few as 20 samples per task). This work suggests a scalable pathway for general-purpose vision models based on pre-trained video-generation backbones.

Abstract

Large language models, trained on extensive corpora, successfully unify diverse linguistic tasks within a single generative framework. Inspired by this, recent works like Large Vision Model (LVM) extend this paradigm to vision by organizing tasks into sequential visual sentences, where visual prompts serve as the context to guide outputs. However, such modeling requires task-specific pre-training across modalities and sources, which is costly and limits scalability to unseen tasks. Given that pre-trained video generation models inherently capture temporal sequence dependencies, we explore a more unified and scalable alternative: can a pre-trained video generation model adapt to diverse image and video tasks? To answer this, we propose UniVid, a framework that fine-tunes a video diffusion transformer to handle various vision tasks without task-specific modifications. Tasks are represented as visual sentences, where the context sequence defines both the task and the expected output modality. We evaluate the generalization of UniVid from two perspectives: (1) cross-modal inference with contexts composed of both images and videos, extending beyond LVM's uni-modal setting; (2) cross-source tasks from natural to annotated data, without multi-source pre-training. Despite being trained solely on natural video data, UniVid generalizes well in both settings. Notably, understanding and generation tasks can easily switch by simply reversing the visual sentence order in this paradigm. These findings highlight the potential of pre-trained video generation models to serve as a scalable and unified foundation for vision modeling. Our code will be released at https://github.com/CUC-MIPG/UniVid.

Paper Structure

This paper contains 22 sections, 17 figures, 7 tables.

Figures (17)

  • Figure 1: LVM bai2024sequential vs. UniVid. (a) LVM bai2024sequential requires large-scale, modality- and source-specific paired data for pre-training to support diverse vision tasks. In contrast, UniVid explores whether a pre-trained video generation model can be efficiently adapted to a broad range of vision tasks via lightweight SFT with minimal paired data. (b) At inference, LVM bai2024sequential is limited to uni-modal visual contexts, whereas UniVid enables a unified framework that accommodates both cross-modal and cross-source vision tasks. Stacked blocks represent videos; a single block represents an image.
  • Figure 2: The framework of UniVid.
  • Figure 3: Main observations. The top colored row serves as a legend indicating the modality and role of each clip shown below. The following figure follows the same format. (a) The model infers the correct output modality from cross-modal contexts. (b) Despite being pre-trained solely on natural video data, it generalizes to cross-source understanding tasks. (c) Under the UniVid framework, understanding and generation tasks are unified and can be converted by reordering the visual sentence.
  • Figure 4: Performance across diverse vision tasks and context formats. We show results for scribble map transfer, motion transfer, and salient object tracking under various visual contexts. Each task is fine-tuned independently within each context configuration, demonstrating that the pre-trained video generation model adapts well across all applicable settings listed in Table \ref{['tab:patterns']}. With a fixed example pair, outputs change with the query, reflecting context-based inference.
  • Figure 5: Unified understanding and generation tasks. Our proposed UniVid allows flexible switching between understanding and generation tasks by simply reordering visual sentences.
  • ...and 12 more figures