Table of Contents
Fetching ...

Collaborative Decoding Makes Visual Auto-Regressive Modeling Efficient

Zigeng Chen, Xinyin Ma, Gongfan Fang, Xinchao Wang

TL;DR

CoDe partition the multi-scale inference process into a seamless collaboration between a large model and a small model, specializing in generating low-frequency content at smaller scales, while the smaller model serves as the ’refiner’, solely focusing on predicting high-frequency details at larger scales.

Abstract

In the rapidly advancing field of image generation, Visual Auto-Regressive (VAR) modeling has garnered considerable attention for its innovative next-scale prediction approach. This paradigm offers substantial improvements in efficiency, scalability, and zero-shot generalization. Yet, the inherently coarse-to-fine nature of VAR introduces a prolonged token sequence, leading to prohibitive memory consumption and computational redundancies. To address these bottlenecks, we propose Collaborative Decoding (CoDe), a novel efficient decoding strategy tailored for the VAR framework. CoDe capitalizes on two critical observations: the substantially reduced parameter demands at larger scales and the exclusive generation patterns across different scales. Based on these insights, we partition the multi-scale inference process into a seamless collaboration between a large model and a small model. The large model serves as the 'drafter', specializing in generating low-frequency content at smaller scales, while the smaller model serves as the 'refiner', solely focusing on predicting high-frequency details at larger scales. This collaboration yields remarkable efficiency with minimal impact on quality: CoDe achieves a 1.7x speedup, slashes memory usage by around 50%, and preserves image quality with only a negligible FID increase from 1.95 to 1.98. When drafting steps are further decreased, CoDe can achieve an impressive 2.9x acceleration ratio, reaching 41 images/s at 256x256 resolution on a single NVIDIA 4090 GPU, while preserving a commendable FID of 2.27. The code is available at https://github.com/czg1225/CoDe

Collaborative Decoding Makes Visual Auto-Regressive Modeling Efficient

TL;DR

CoDe partition the multi-scale inference process into a seamless collaboration between a large model and a small model, specializing in generating low-frequency content at smaller scales, while the smaller model serves as the ’refiner’, solely focusing on predicting high-frequency details at larger scales.

Abstract

In the rapidly advancing field of image generation, Visual Auto-Regressive (VAR) modeling has garnered considerable attention for its innovative next-scale prediction approach. This paradigm offers substantial improvements in efficiency, scalability, and zero-shot generalization. Yet, the inherently coarse-to-fine nature of VAR introduces a prolonged token sequence, leading to prohibitive memory consumption and computational redundancies. To address these bottlenecks, we propose Collaborative Decoding (CoDe), a novel efficient decoding strategy tailored for the VAR framework. CoDe capitalizes on two critical observations: the substantially reduced parameter demands at larger scales and the exclusive generation patterns across different scales. Based on these insights, we partition the multi-scale inference process into a seamless collaboration between a large model and a small model. The large model serves as the 'drafter', specializing in generating low-frequency content at smaller scales, while the smaller model serves as the 'refiner', solely focusing on predicting high-frequency details at larger scales. This collaboration yields remarkable efficiency with minimal impact on quality: CoDe achieves a 1.7x speedup, slashes memory usage by around 50%, and preserves image quality with only a negligible FID increase from 1.95 to 1.98. When drafting steps are further decreased, CoDe can achieve an impressive 2.9x acceleration ratio, reaching 41 images/s at 256x256 resolution on a single NVIDIA 4090 GPU, while preserving a commendable FID of 2.27. The code is available at https://github.com/czg1225/CoDe

Paper Structure

This paper contains 14 sections, 6 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: We partition the next-scale prediction process into the efficient collaboration between large and small VAR models.
  • Figure 2: Comparison of generation results between original VAR-d30 (up) and our VAR-CoDE (bottom) for ImageNet 256$\times$256. Our method achieves 1.7x speedup (3.62s to 2.11s), and needs only 0.5x memory space (40GB to 20GB), with negligible quality degradation.
  • Figure 3: (a) Effectiveness of increasing parameters at the $k$-th scale is evaluated by predicting token map $r_k$ using four VAR models with different parameter sizes (2B, 1B, 0.6B, and 0.3B), while other scales $(r_1, r_2, \dots, r_{k-1}, r_{k+1}, \dots, r_{10})$ are generated using the largest VAR-d30 model. (b) Fourier spectrum analysis is conducted on generated content at the first 3 scales and the last 3 scales. (c) Training-free performance comparison of model collaboration decoding across various settings of draft tokens $M$ and refiner tokens $680-M$.
  • Figure 4: Overview of the collaborative decoding process, we use a drafting step $N = 6$ for instance. CoDe uses a large VAR model as the drafter $\epsilon_{\theta_d}$ to generate the token maps $R_L = (r_1, r_2, \ldots, r_N)$ at smaller scales. The small refiner model $\epsilon_{\theta_r}$ then uses $R_L$ as an initial prefix to efficiently predict the remaining token maps $R_H = (r_{N+1}, r_{N+2}, \ldots, r_K)$ at larger scales. Both models are fine-tuned on their designated predictive scales using ground truth labels $r_k^*$ and teacher logits $p_{\text{teacher}}(r_k)$, respectively.
  • Figure 5: (a) Our CoDe demonstrates the optimal efficiency-quality trade-off among all evaluated methods. (b) Inference latency is measured across varying batch sizes for the original VAR-d30, our CoDe (N=6), and the VQVAE decoder. (c) We analyze the time cost associated with parallel decoding at each scale, showing that the refiner model is significantly more efficient than the drafter at larger scales.
  • ...and 5 more figures