Table of Contents
Fetching ...

Hyper-Bagel: A Unified Acceleration Framework for Multimodal Understanding and Generation

Yanzuo Lu, Xin Xia, Manlin Zhang, Huafeng Kuang, Jianbin Zheng, Yuxi Ren, Xuefeng Xiao

TL;DR

Hyper-Bagel tackles the high computational burden of unified multimodal models by coupling speculative decoding for fast token prediction with a multi-stage diffusion distillation pipeline. The approach yields a lossless 6-NFE model that dramatically accelerates text-to-image generation (≈16.7x) and image editing (≈22x), while also doubling multimodal understanding speed; a 1-NFE variant enables near real-time interactive editing and generation. Data from open sources supports training and distillation across image-text, generation, and editing tasks, with rigorous evaluation on GenEval and GEdit-Bench showing preserved output quality. Overall, Hyper-Bagel demonstrates a practical pathway to deploy sophisticated multimodal models with real-time responsiveness without compromising performance.

Abstract

Unified multimodal models have recently attracted considerable attention for their remarkable abilities in jointly understanding and generating diverse content. However, as contexts integrate increasingly numerous interleaved multimodal tokens, the iterative processes of diffusion denoising and autoregressive decoding impose significant computational overhead. To address this, we propose Hyper-Bagel, a unified acceleration framework designed to simultaneously speed up both multimodal understanding and generation tasks. Our approach uses a divide-and-conquer strategy, employing speculative decoding for next-token prediction and a multi-stage distillation process for diffusion denoising. The framework delivers substantial performance gains, achieving over a 2x speedup in multimodal understanding. For generative tasks, our resulting lossless 6-NFE model yields a 16.67x speedup in text-to-image generation and a 22x speedup in image editing, all while preserving the high-quality output of the original model. We further develop a highly efficient 1-NFE model that enables near real-time interactive editing and generation. By combining advanced adversarial distillation with human feedback learning, this model achieves ultimate cost-effectiveness and responsiveness, making complex multimodal interactions seamless and instantaneous.

Hyper-Bagel: A Unified Acceleration Framework for Multimodal Understanding and Generation

TL;DR

Hyper-Bagel tackles the high computational burden of unified multimodal models by coupling speculative decoding for fast token prediction with a multi-stage diffusion distillation pipeline. The approach yields a lossless 6-NFE model that dramatically accelerates text-to-image generation (≈16.7x) and image editing (≈22x), while also doubling multimodal understanding speed; a 1-NFE variant enables near real-time interactive editing and generation. Data from open sources supports training and distillation across image-text, generation, and editing tasks, with rigorous evaluation on GenEval and GEdit-Bench showing preserved output quality. Overall, Hyper-Bagel demonstrates a practical pathway to deploy sophisticated multimodal models with real-time responsiveness without compromising performance.

Abstract

Unified multimodal models have recently attracted considerable attention for their remarkable abilities in jointly understanding and generating diverse content. However, as contexts integrate increasingly numerous interleaved multimodal tokens, the iterative processes of diffusion denoising and autoregressive decoding impose significant computational overhead. To address this, we propose Hyper-Bagel, a unified acceleration framework designed to simultaneously speed up both multimodal understanding and generation tasks. Our approach uses a divide-and-conquer strategy, employing speculative decoding for next-token prediction and a multi-stage distillation process for diffusion denoising. The framework delivers substantial performance gains, achieving over a 2x speedup in multimodal understanding. For generative tasks, our resulting lossless 6-NFE model yields a 16.67x speedup in text-to-image generation and a 22x speedup in image editing, all while preserving the high-quality output of the original model. We further develop a highly efficient 1-NFE model that enables near real-time interactive editing and generation. By combining advanced adversarial distillation with human feedback learning, this model achieves ultimate cost-effectiveness and responsiveness, making complex multimodal interactions seamless and instantaneous.

Paper Structure

This paper contains 18 sections, 2 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Image generation samples produced by our 6-NFE accelerated BAGEL model.
  • Figure 2: Image editing samples produced by our 6-NFE accelerated BAGEL model.
  • Figure 3: Training pipeline for our proposed speculative decoding approach in Hyper-Bagel.
  • Figure 4: Training pipeline for our proposed Distribution Matching Distillation via ODE (DMDO).
  • Figure 5: Qualitative comparison of different accelerated models against the baseline on image generation.
  • ...and 1 more figures