Table of Contents
Fetching ...

Unison: A Fully Automatic, Task-Universal, and Low-Cost Framework for Unified Understanding and Generation

Shihao Zhao, Yitong Chen, Zeyinzi Jiang, Bojia Zi, Shaozhe Hao, Yu Liu, Chaojie Mao, Kwan-Yee K. Wong

TL;DR

Unison tackles unified multimodal understanding and generation with a low-cost, two-stage framework that preserves pre-trained capabilities while enabling automatic task planning. It uses a planning dataset to train a stage-one understanding model (Qwen2.5-VL) with LoRA fine-tuning to identify task types and hyper-parameters, and a stage-two generator (VACE) guided by a trainable projector for cross-stage alignment. With only 500k training samples and 50 GPU hours, Unison covers 12 tasks across text, image, and video modalities, including generation tasks like text-to-video, editing, controllable generation, and IP-based reference generation. Experiments show competitive performance on standard benchmarks, high automation of task planning, and markedly lower training costs, making unified multimodal understanding and generation more accessible to researchers with limited resources.

Abstract

Unified understanding and generation is a highly appealing research direction in multimodal learning. There exist two approaches: one trains a transformer via an auto-regressive paradigm, and the other adopts a two-stage scheme connecting pre-trained understanding and generative models for alignment fine-tuning. The former demands massive data and computing resources unaffordable for ordinary researchers. Though the latter requires a lower training cost, existing works often suffer from limited task coverage or poor generation quality. Both approaches lack the ability to parse input meta-information (such as task type, image resolution, video duration, etc.) and require manual parameter configuration that is tedious and non-intelligent. In this paper, we propose Unison which adopts the two-stage scheme while preserving the capabilities of the pre-trained models well. With an extremely low training cost, we cover a variety of multimodal understanding tasks, including text, image, and video understanding, as well as diverse generation tasks, such as text-to-visual content generation, editing, controllable generation, and IP-based reference generation. We also equip our model with the ability to automatically parse user intentions, determine the target task type, and accurately extract the meta-information required for the corresponding task. This enables full automation of various multimodal tasks without human intervention. Experiments demonstrate that, under a low-cost setting of only 500k training samples and 50 GPU hours, our model can accurately and automatically identify tasks and extract relevant parameters, and achieve superior performance across a variety of understanding and generation tasks.

Unison: A Fully Automatic, Task-Universal, and Low-Cost Framework for Unified Understanding and Generation

TL;DR

Unison tackles unified multimodal understanding and generation with a low-cost, two-stage framework that preserves pre-trained capabilities while enabling automatic task planning. It uses a planning dataset to train a stage-one understanding model (Qwen2.5-VL) with LoRA fine-tuning to identify task types and hyper-parameters, and a stage-two generator (VACE) guided by a trainable projector for cross-stage alignment. With only 500k training samples and 50 GPU hours, Unison covers 12 tasks across text, image, and video modalities, including generation tasks like text-to-video, editing, controllable generation, and IP-based reference generation. Experiments show competitive performance on standard benchmarks, high automation of task planning, and markedly lower training costs, making unified multimodal understanding and generation more accessible to researchers with limited resources.

Abstract

Unified understanding and generation is a highly appealing research direction in multimodal learning. There exist two approaches: one trains a transformer via an auto-regressive paradigm, and the other adopts a two-stage scheme connecting pre-trained understanding and generative models for alignment fine-tuning. The former demands massive data and computing resources unaffordable for ordinary researchers. Though the latter requires a lower training cost, existing works often suffer from limited task coverage or poor generation quality. Both approaches lack the ability to parse input meta-information (such as task type, image resolution, video duration, etc.) and require manual parameter configuration that is tedious and non-intelligent. In this paper, we propose Unison which adopts the two-stage scheme while preserving the capabilities of the pre-trained models well. With an extremely low training cost, we cover a variety of multimodal understanding tasks, including text, image, and video understanding, as well as diverse generation tasks, such as text-to-visual content generation, editing, controllable generation, and IP-based reference generation. We also equip our model with the ability to automatically parse user intentions, determine the target task type, and accurately extract the meta-information required for the corresponding task. This enables full automation of various multimodal tasks without human intervention. Experiments demonstrate that, under a low-cost setting of only 500k training samples and 50 GPU hours, our model can accurately and automatically identify tasks and extract relevant parameters, and achieve superior performance across a variety of understanding and generation tasks.

Paper Structure

This paper contains 14 sections, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Overview of the Unison. The left shows the stage-one understanding model, the right shows the stage-two generation model. These two components are connected via a projector. For understanding tasks, the stage-one model directly outputs the results. For generation tasks, the stage-one model extracts the task type and the corresponding hyper-parameters from the user's input, then passes this information to the generation model to produce visual content. On one hand, we train the understanding model with LoRA to endow it with the ability to comprehend user intent; on the other hand, we freeze the models in both stages and train the projector for alignment.
  • Figure 2: Visualizations of Unison’s multimodal understanding capabilities. From left to right, the results correspond to understanding of text, images, and videos, respectively. In the figures, the green dialogue boxes represent user inputs, and the blue dialogue boxes represent the model’s responses.
  • Figure 3: Visualizations of Unison’s multimodal generation capabilities. The leftmost part shows the user input, where the green box contains the input prompt, and the content to the left of the box represents the image, video, or mask condition. The middle blue box shows the output of the stage-one model, which mainly includes signal tokens that guide the generation task of the stage-two model. The right side displays the final generated results. The first three rows correspond to the text-to-image, image editing, and image reference generation tasks, respectively. The following five rows illustrate the text-to-video, image-to-video, video reference generation, video editing, and video controllable generation tasks, respectively. Note that the generated videos are uniformly sampled with three frames for visualization.
  • Figure 4: Ablation study on whether alignment is performed between the understanding and generation models in stage two. The left example illustrates a video generation task, and the right example shows an image generation task. In each case, the green box represents the user input, where the orange part indicates the task type and hyper-parameters that are unrelated to the visual content. The first row of results corresponds to using the projector for alignment, and the second row corresponds to no training or alignment being applied.
  • Figure 5: The system prompt used for combining templates and raw user instructions.
  • ...and 2 more figures