Unison: A Fully Automatic, Task-Universal, and Low-Cost Framework for Unified Understanding and Generation
Shihao Zhao, Yitong Chen, Zeyinzi Jiang, Bojia Zi, Shaozhe Hao, Yu Liu, Chaojie Mao, Kwan-Yee K. Wong
TL;DR
Unison tackles unified multimodal understanding and generation with a low-cost, two-stage framework that preserves pre-trained capabilities while enabling automatic task planning. It uses a planning dataset to train a stage-one understanding model (Qwen2.5-VL) with LoRA fine-tuning to identify task types and hyper-parameters, and a stage-two generator (VACE) guided by a trainable projector for cross-stage alignment. With only 500k training samples and 50 GPU hours, Unison covers 12 tasks across text, image, and video modalities, including generation tasks like text-to-video, editing, controllable generation, and IP-based reference generation. Experiments show competitive performance on standard benchmarks, high automation of task planning, and markedly lower training costs, making unified multimodal understanding and generation more accessible to researchers with limited resources.
Abstract
Unified understanding and generation is a highly appealing research direction in multimodal learning. There exist two approaches: one trains a transformer via an auto-regressive paradigm, and the other adopts a two-stage scheme connecting pre-trained understanding and generative models for alignment fine-tuning. The former demands massive data and computing resources unaffordable for ordinary researchers. Though the latter requires a lower training cost, existing works often suffer from limited task coverage or poor generation quality. Both approaches lack the ability to parse input meta-information (such as task type, image resolution, video duration, etc.) and require manual parameter configuration that is tedious and non-intelligent. In this paper, we propose Unison which adopts the two-stage scheme while preserving the capabilities of the pre-trained models well. With an extremely low training cost, we cover a variety of multimodal understanding tasks, including text, image, and video understanding, as well as diverse generation tasks, such as text-to-visual content generation, editing, controllable generation, and IP-based reference generation. We also equip our model with the ability to automatically parse user intentions, determine the target task type, and accurately extract the meta-information required for the corresponding task. This enables full automation of various multimodal tasks without human intervention. Experiments demonstrate that, under a low-cost setting of only 500k training samples and 50 GPU hours, our model can accurately and automatically identify tasks and extract relevant parameters, and achieve superior performance across a variety of understanding and generation tasks.
