Importance-Based Token Merging for Efficient Image and Video Generation
Haoyu Wu, Jingyi Xu, Hieu Le, Dimitris Samaras
TL;DR
This work addresses the heavy computational burden of diffusion-based image and video generation by introducing an importance-based token merging framework. It uses classifier-free guidance (CFG) to derive per-token importance, builds a pool of important tokens, and applies a bipartite soft-matching strategy to merge tokens while preserving crucial content; this yields higher fidelity and finer details at reduced compute. The method demonstrates state-of-the-art performance across text-to-image, multi-view, and video generation on models like Stable Diffusion 2, Zero123++, AnimateDiff, and PixArt-$\alpha$, with substantial gains in image quality metrics (e.g., FID, PSNR/SSIM, LPIPS) and competitive inference costs. Importantly, CFG provides a low-cost, broadly applicable importance signal, and the approach remains compatible with orthogonal acceleration techniques, enabling flexible, scalable speedups for diffusion-based generation.
Abstract
Token merging can effectively accelerate various vision systems by processing groups of similar tokens only once and sharing the results across them. However, existing token grouping methods are often ad hoc and random, disregarding the actual content of the samples. We show that preserving high-information tokens during merging - those essential for semantic fidelity and structural details - significantly improves sample quality, producing finer details and more coherent, realistic generations. Despite being simple and intuitive, this approach remains underexplored. To do so, we propose an importance-based token merging method that prioritizes the most critical tokens in computational resource allocation, leveraging readily available importance scores, such as those from classifier-free guidance in diffusion models. Experiments show that our approach significantly outperforms baseline methods across multiple applications, including text-to-image synthesis, multi-view image generation, and video generation with various model architectures such as Stable Diffusion, Zero123++, AnimateDiff, or PixArt-$α$.
