Table of Contents
Fetching ...

BiGain: Unified Token Compression for Joint Generation and Classification

Jiacheng Liu, Shengkun Tang, Jiacheng Cui, Dongkuan Xu, Zhiqiang Shen

Abstract

Acceleration methods for diffusion models (e.g., token merging or downsampling) typically optimize synthesis quality under reduced compute, yet often ignore discriminative capacity. We revisit token compression with a joint objective and present BiGain, a training-free, plug-and-play framework that preserves generation quality while improving classification in accelerated diffusion models. Our key insight is frequency separation: mapping feature-space signals into a frequency-aware representation disentangles fine detail from global semantics, enabling compression that respects both generative fidelity and discriminative utility. BiGain reflects this principle with two frequency-aware operators: (1) Laplacian-gated token merging, which encourages merges among spectrally smooth tokens while discouraging merges of high-contrast tokens, thereby retaining edges and textures; and (2) Interpolate-Extrapolate KV Downsampling, which downsamples keys/values via a controllable interextrapolation between nearest and average pooling while keeping queries intact, thereby conserving attention precision. Across DiT- and U-Net-based backbones and ImageNet-1K, ImageNet-100, Oxford-IIIT Pets, and COCO-2017, our operators consistently improve the speed-accuracy trade-off for diffusion-based classification, while maintaining or enhancing generation quality under comparable acceleration. For instance, on ImageNet-1K, with 70% token merging on Stable Diffusion 2.0, BiGain increases classification accuracy by 7.15% while improving FID by 0.34 (1.85%). Our analyses indicate that balanced spectral retention, preserving high-frequency detail and low/mid-frequency semantics, is a reliable design rule for token compression in diffusion models. To our knowledge, BiGain is the first framework to jointly study and advance both generation and classification under accelerated diffusion, supporting lower-cost deployment.

BiGain: Unified Token Compression for Joint Generation and Classification

Abstract

Acceleration methods for diffusion models (e.g., token merging or downsampling) typically optimize synthesis quality under reduced compute, yet often ignore discriminative capacity. We revisit token compression with a joint objective and present BiGain, a training-free, plug-and-play framework that preserves generation quality while improving classification in accelerated diffusion models. Our key insight is frequency separation: mapping feature-space signals into a frequency-aware representation disentangles fine detail from global semantics, enabling compression that respects both generative fidelity and discriminative utility. BiGain reflects this principle with two frequency-aware operators: (1) Laplacian-gated token merging, which encourages merges among spectrally smooth tokens while discouraging merges of high-contrast tokens, thereby retaining edges and textures; and (2) Interpolate-Extrapolate KV Downsampling, which downsamples keys/values via a controllable interextrapolation between nearest and average pooling while keeping queries intact, thereby conserving attention precision. Across DiT- and U-Net-based backbones and ImageNet-1K, ImageNet-100, Oxford-IIIT Pets, and COCO-2017, our operators consistently improve the speed-accuracy trade-off for diffusion-based classification, while maintaining or enhancing generation quality under comparable acceleration. For instance, on ImageNet-1K, with 70% token merging on Stable Diffusion 2.0, BiGain increases classification accuracy by 7.15% while improving FID by 0.34 (1.85%). Our analyses indicate that balanced spectral retention, preserving high-frequency detail and low/mid-frequency semantics, is a reliable design rule for token compression in diffusion models. To our knowledge, BiGain is the first framework to jointly study and advance both generation and classification under accelerated diffusion, supporting lower-cost deployment.
Paper Structure (49 sections, 1 theorem, 26 equations, 6 figures, 16 tables, 4 algorithms)

This paper contains 49 sections, 1 theorem, 26 equations, 6 figures, 16 tables, 4 algorithms.

Key Result

Theorem 1

Assume $\mu>0$, $\mu'>0$, $\sigma^2>0$, and that $\mu'$ and $\sigma'^2$ are defined as above. Then the post-reduction ratio $r'$ is smaller than the original ratio $r$, $r' < r,$ if and only if Moreover, when $|\Delta\mu| \ll \mu$, the first-order sufficient condition guarantees $r'<r$ and hence a strictly improved Cantelli bound.

Figures (6)

  • Figure 1: Framework of our BiGainTM method. A Laplacian filter is applied to hidden-state tokens to compute local frequency scores. In each spatial stride, the lowest-scoring token is selected as a destination token, while the others form the source set. Destination and source tokens are gathered globally, and a bipartite matching selects top source-destination pairs.
  • Figure 2: Impact of token compression on diffusion models as our motivation on COCO2017 and ImageNet-100. Left: ToMe bolya2023token (baseline) vs. Laplacian-Gated Merge (ours) as the merge ratio increases. Right: ToDo smith2024todo (baseline) vs. Interpolate-Extrapolate KV-Downsampling (ours) as the downsample factor grows. Curves report percent change relative to the uncompressed model ($\uparrow$ better; for FID we plot $-\Delta$FID%). Blue: classification accuracy. Orange: generation quality (FID).
  • Figure 3: Qualitative comparison of BiGain$_\texttt{TM}$ and ToMe on SD-2.0 backbone. From left to right: BiGain$_\texttt{TM}$ with 70%, 50%, and 30% merge ratios, no acceleration, then ToMe with 30%, 50%, and 70% merge ratios.
  • Figure 4: Qualitative comparison of BiGain$_\texttt{TD}$ and ToDo on SD-2.0 backbone. From left to right: BiGain$_\texttt{TD}$ with downsampling factors $4\times$, $3\times$, and $2\times$, no acceleration, then ToDo with factors $2\times$, $3\times$, and $4\times$.
  • Figure 5: Visualization of our Laplacian-based frequency heuristic on hidden representations from Stable Diffusion-2.0. We probe U-Net at the highest-resolution upsampling stage. The visualization is computed from a noised image without a text prompt, showing the model's intrinsic frequency-aware reconstruction dynamics. To reduce variance, we randomly sample 100 independent noise realizations and visualize the averaged token salience map.
  • ...and 1 more figures

Theorems & Definitions (1)

  • Theorem 1: Spectral margin--variance improvement