Table of Contents
Fetching ...

Importance-Based Token Merging for Efficient Image and Video Generation

Haoyu Wu, Jingyi Xu, Hieu Le, Dimitris Samaras

TL;DR

This work addresses the heavy computational burden of diffusion-based image and video generation by introducing an importance-based token merging framework. It uses classifier-free guidance (CFG) to derive per-token importance, builds a pool of important tokens, and applies a bipartite soft-matching strategy to merge tokens while preserving crucial content; this yields higher fidelity and finer details at reduced compute. The method demonstrates state-of-the-art performance across text-to-image, multi-view, and video generation on models like Stable Diffusion 2, Zero123++, AnimateDiff, and PixArt-$\alpha$, with substantial gains in image quality metrics (e.g., FID, PSNR/SSIM, LPIPS) and competitive inference costs. Importantly, CFG provides a low-cost, broadly applicable importance signal, and the approach remains compatible with orthogonal acceleration techniques, enabling flexible, scalable speedups for diffusion-based generation.

Abstract

Token merging can effectively accelerate various vision systems by processing groups of similar tokens only once and sharing the results across them. However, existing token grouping methods are often ad hoc and random, disregarding the actual content of the samples. We show that preserving high-information tokens during merging - those essential for semantic fidelity and structural details - significantly improves sample quality, producing finer details and more coherent, realistic generations. Despite being simple and intuitive, this approach remains underexplored. To do so, we propose an importance-based token merging method that prioritizes the most critical tokens in computational resource allocation, leveraging readily available importance scores, such as those from classifier-free guidance in diffusion models. Experiments show that our approach significantly outperforms baseline methods across multiple applications, including text-to-image synthesis, multi-view image generation, and video generation with various model architectures such as Stable Diffusion, Zero123++, AnimateDiff, or PixArt-$α$.

Importance-Based Token Merging for Efficient Image and Video Generation

TL;DR

This work addresses the heavy computational burden of diffusion-based image and video generation by introducing an importance-based token merging framework. It uses classifier-free guidance (CFG) to derive per-token importance, builds a pool of important tokens, and applies a bipartite soft-matching strategy to merge tokens while preserving crucial content; this yields higher fidelity and finer details at reduced compute. The method demonstrates state-of-the-art performance across text-to-image, multi-view, and video generation on models like Stable Diffusion 2, Zero123++, AnimateDiff, and PixArt-, with substantial gains in image quality metrics (e.g., FID, PSNR/SSIM, LPIPS) and competitive inference costs. Importantly, CFG provides a low-cost, broadly applicable importance signal, and the approach remains compatible with orthogonal acceleration techniques, enabling flexible, scalable speedups for diffusion-based generation.

Abstract

Token merging can effectively accelerate various vision systems by processing groups of similar tokens only once and sharing the results across them. However, existing token grouping methods are often ad hoc and random, disregarding the actual content of the samples. We show that preserving high-information tokens during merging - those essential for semantic fidelity and structural details - significantly improves sample quality, producing finer details and more coherent, realistic generations. Despite being simple and intuitive, this approach remains underexplored. To do so, we propose an importance-based token merging method that prioritizes the most critical tokens in computational resource allocation, leveraging readily available importance scores, such as those from classifier-free guidance in diffusion models. Experiments show that our approach significantly outperforms baseline methods across multiple applications, including text-to-image synthesis, multi-view image generation, and video generation with various model architectures such as Stable Diffusion, Zero123++, AnimateDiff, or PixArt-.

Paper Structure

This paper contains 34 sections, 2 equations, 15 figures, 15 tables.

Figures (15)

  • Figure 1: Importance-based Token Merging. Our method prioritizes important tokens during token merging, resulting in images with greater details in essential areas compared to ToMeSD Bolya2023TokenMF. In the second row, we show the regions (in white) where computation (e.g., attention) will take place after token merging.
  • Figure 2: Overview. We propose an importance-based token merging method. The importance of each token can be determined using classifier-free guidance. These scores, visualized with colors ranging from light to dark (indicating less to more important tokens), are used to construct a pool of important tokens. We randomly select a set of destination (dst) tokens from this pool and the remaining important tokens become source (src) tokens. Bipartite soft matching is then performed between the dst tokens and src tokens. src tokens without a suitable match are considered independent tokens (ind.). All other src tokens and unimportant tokens are merged with the destination tokens for subsequent computational steps.
  • Figure 3: Importance Maps. We present token importance maps derived from classifier-free guidance (CFG) across diffusion inference timesteps. These maps highlight areas significantly align with the user prompt. In the early steps, they capture the semantics and structure of the image relevant to the prompt, while in later steps, they focus on finer details of the objects the user intends to generate. The generated image is shown on the left for reference.
  • Figure 4: We compare our method with an approach that uses the top-k important tokens as destination tokens (dst) for token merging. The computation locations after token merging are illustrated as non-black pixels in the bottom-left windows. They include locations of dst tokens, which are shown in white, and independent tokens (some other tokens that lack a similar dst token for merging), which are shown in orange. Our method produces more structured and detailed image, as highlighted in the red box.
  • Figure 5: Qualitative comparison of text-to-image generation. The first column shows results from Stable Diffusion (SD) rombach2022high, while the subsequent columns show SD combined with various token merging methods. As highlighted in red boxes, our approach consistently produces finer details with coherent structures. Note that ATC requires minutes to generate an image, whereas other methods, including ours, complete the task in seconds. The token merging ratio is 0.7. Please see the supplementary for prompts. Best viewed with zoom-in.
  • ...and 10 more figures