Table of Contents
Fetching ...

Are Unified Vision-Language Models Necessary: Generalization Across Understanding and Generation

Jihai Zhang, Tianle Li, Linjie Li, Zhengyuan Yang, Yu Cheng

TL;DR

The paper tackles whether unified vision-language models (VLMs) truly generalize across understanding and generation tasks. It uses a carefully controlled synthetic dataset and multiple architecture configurations to study cross-task generalization, alignment of vision spaces, and transfer of generation-derived knowledge to understanding tasks. Key findings show mutual benefits between tasks that scale with data, a critical role for aligning vision input and output spaces, and knowledge transfer occurring within the base language model. A real-world validation with LLaVA demonstrates that mixed training improves performance across benchmarks without task interference, underscoring the practical value of unified VLMs for scalable vision-language systems.

Abstract

Recent advancements in unified vision-language models (VLMs), which integrate both visual understanding and generation capabilities, have attracted significant attention. The underlying hypothesis is that a unified architecture with mixed training on both understanding and generation tasks can enable mutual enhancement between understanding and generation. However, this hypothesis remains underexplored in prior works on unified VLMs. To address this gap, this paper systematically investigates the generalization across understanding and generation tasks in unified VLMs. Specifically, we design a dataset closely aligned with real-world scenarios to facilitate extensive experiments and quantitative evaluations. We evaluate multiple unified VLM architectures to validate our findings. Our key findings are as follows. First, unified VLMs trained with mixed data exhibit mutual benefits in understanding and generation tasks across various architectures, and this mutual benefits can scale up with increased data. Second, better alignment between multimodal input and output spaces will lead to better generalization. Third, the knowledge acquired during generation tasks can transfer to understanding tasks, and this cross-task generalization occurs within the base language model, beyond modality adapters. Our findings underscore the critical necessity of unifying understanding and generation in VLMs, offering valuable insights for the design and optimization of unified VLMs.

Are Unified Vision-Language Models Necessary: Generalization Across Understanding and Generation

TL;DR

The paper tackles whether unified vision-language models (VLMs) truly generalize across understanding and generation tasks. It uses a carefully controlled synthetic dataset and multiple architecture configurations to study cross-task generalization, alignment of vision spaces, and transfer of generation-derived knowledge to understanding tasks. Key findings show mutual benefits between tasks that scale with data, a critical role for aligning vision input and output spaces, and knowledge transfer occurring within the base language model. A real-world validation with LLaVA demonstrates that mixed training improves performance across benchmarks without task interference, underscoring the practical value of unified VLMs for scalable vision-language systems.

Abstract

Recent advancements in unified vision-language models (VLMs), which integrate both visual understanding and generation capabilities, have attracted significant attention. The underlying hypothesis is that a unified architecture with mixed training on both understanding and generation tasks can enable mutual enhancement between understanding and generation. However, this hypothesis remains underexplored in prior works on unified VLMs. To address this gap, this paper systematically investigates the generalization across understanding and generation tasks in unified VLMs. Specifically, we design a dataset closely aligned with real-world scenarios to facilitate extensive experiments and quantitative evaluations. We evaluate multiple unified VLM architectures to validate our findings. Our key findings are as follows. First, unified VLMs trained with mixed data exhibit mutual benefits in understanding and generation tasks across various architectures, and this mutual benefits can scale up with increased data. Second, better alignment between multimodal input and output spaces will lead to better generalization. Third, the knowledge acquired during generation tasks can transfer to understanding tasks, and this cross-task generalization occurs within the base language model, beyond modality adapters. Our findings underscore the critical necessity of unifying understanding and generation in VLMs, offering valuable insights for the design and optimization of unified VLMs.

Paper Structure

This paper contains 17 sections, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Unified VLMs surpass understanding-only and generation-only models. Alignment of vision input-output spaces further boosts performance. Results from Section\ref{['sec:exp']}.
  • Figure 2: Samples from the Smart Watch UI Dataset with different ground truth attributes.
  • Figure 3: Image understanding and generation performance of VLMs during training. "_g" denotes generation-only training, and "_u" denotes understanding-only training. Unified VLMs trained with mixture of understanding and generation data outperform task-specific models trained with understanding-only or generation-only data.
  • Figure 4: Comparison between VLMs with and without vision input space distortion. Vision input space distortion has little effect on understanding-only VLMs, but significantly decrease the performance of unified VLMs.
  • Figure 5: Performance of SigLIP-VQ and VQ-VQ under varying data scales. Only increase the amount of generation data can boost the performance in understanding tasks, and vice versa.
  • ...and 3 more figures