Table of Contents
Fetching ...

HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation

Ling Yang, Xinchen Zhang, Ye Tian, Chenming Shang, Minghao Xu, Wentao Zhang, Bin Cui

TL;DR

This work identifies a consistent gap in multimodal LLMs where understanding exceeds generation. It introduces HermesFlow, a general framework that uses homologous input data to curate paired understanding and generation preferences and aligns them with Pair-DPO under a self-play paradigm. Across experiments, HermesFlow narrows the gap and achieves competitive performance on both understanding and image generation benchmarks with relatively small backbones. The approach suggests a practical, data-efficient pathway for aligning future multimodal foundation models.

Abstract

The remarkable success of the autoregressive paradigm has made significant advancement in Multimodal Large Language Models (MLLMs), with powerful models like Show-o, Transfusion and Emu3 achieving notable progress in unified image understanding and generation. For the first time, we uncover a common phenomenon: the understanding capabilities of MLLMs are typically stronger than their generative capabilities, with a significant gap between the two. Building on this insight, we propose HermesFlow, a simple yet general framework designed to seamlessly bridge the gap between understanding and generation in MLLMs. Specifically, we take the homologous data as input to curate homologous preference data of both understanding and generation. Through Pair-DPO and self-play iterative optimization, HermesFlow effectively aligns multimodal understanding and generation using homologous preference data. Extensive experiments demonstrate the significant superiority of our approach over prior methods, particularly in narrowing the gap between multimodal understanding and generation. These findings highlight the potential of HermesFlow as a general alignment framework for next-generation multimodal foundation models. Code: https://github.com/Gen-Verse/HermesFlow

HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation

TL;DR

This work identifies a consistent gap in multimodal LLMs where understanding exceeds generation. It introduces HermesFlow, a general framework that uses homologous input data to curate paired understanding and generation preferences and aligns them with Pair-DPO under a self-play paradigm. Across experiments, HermesFlow narrows the gap and achieves competitive performance on both understanding and image generation benchmarks with relatively small backbones. The approach suggests a practical, data-efficient pathway for aligning future multimodal foundation models.

Abstract

The remarkable success of the autoregressive paradigm has made significant advancement in Multimodal Large Language Models (MLLMs), with powerful models like Show-o, Transfusion and Emu3 achieving notable progress in unified image understanding and generation. For the first time, we uncover a common phenomenon: the understanding capabilities of MLLMs are typically stronger than their generative capabilities, with a significant gap between the two. Building on this insight, we propose HermesFlow, a simple yet general framework designed to seamlessly bridge the gap between understanding and generation in MLLMs. Specifically, we take the homologous data as input to curate homologous preference data of both understanding and generation. Through Pair-DPO and self-play iterative optimization, HermesFlow effectively aligns multimodal understanding and generation using homologous preference data. Extensive experiments demonstrate the significant superiority of our approach over prior methods, particularly in narrowing the gap between multimodal understanding and generation. These findings highlight the potential of HermesFlow as a general alignment framework for next-generation multimodal foundation models. Code: https://github.com/Gen-Verse/HermesFlow

Paper Structure

This paper contains 30 sections, 15 equations, 7 figures, 5 tables, 1 algorithm.

Figures (7)

  • Figure 1: Architecture comparison between (a) DPO training improve multimodal understanding zhou2024calibratedhe2024self, (b) DPO training improve multimodal generation wang2024emu3 and (c) our HermesFlow.
  • Figure 2: Motivation of HermesFlow. (a) A general pipeline to quantitatively assess the MLLM's performance of multimodal understanding and generation. (b) The imbalance between understanding and generation capabilities is a common phenomenon in MLLMs, and our method ssignificantly narrows this disparity. For detailed descriptions, please refer to \ref{['gapscore']}.
  • Figure 3: Pipeline of HermesFlow. We begin by curating paired data that captures both understanding and generation preferences from homologous input data. Leveraging this homologous preference data, we design Pair-DPO and employ self-play iterative optimization to seamlessly bridge the gap between multimodal understanding and generation.
  • Figure 4: Qualitative comparison between our HermesFlow and three outstanding Multimodal LLMs VILA-U wu2024vila, Janus wu2024janus, and Show-o xie2024show. Colored text denotes the advantages of HermesFlow in generated images.
  • Figure 5: Results of user study.
  • ...and 2 more figures