Table of Contents
Fetching ...

ActVAR: Activating Mixtures of Weights and Tokens for Efficient Visual Autoregressive Generation

Kaixin Zhang, Ruiqing Yang, Yuan Zhang, Shan You, Tao Huang

TL;DR

ActVAR addresses the computational bottleneck of Visual Autoregressive models by introducing dynamic dual sparsity: activating selective FFN experts and prioritizing informative tokens at each layer. It implements a Mixture-of-Experts FFN with a learnable router and a gated token selector, augmented by a reconstruction path to preserve global context. A two-stage knowledge-distillation training regime from a pretrained VAR teacher aligns routing and gating with pretrained dependencies, achieving up to $21.2\%$ FLOPs savings on ImageNet $256\times256$ with minimal quality loss. The approach preserves full model capacity while adaptively allocating computation, offering practical acceleration for scalable autoregressive image generation.

Abstract

Visual Autoregressive (VAR) models enable efficient image generation via next-scale prediction but face escalating computational costs as sequence length grows. Existing static pruning methods degrade performance by permanently removing weights or tokens, disrupting pretrained dependencies. To address this, we propose ActVAR, a dynamic activation framework that introduces dual sparsity across model weights and token sequences to enhance efficiency without sacrificing capacity. ActVAR decomposes feedforward networks (FFNs) into lightweight expert sub-networks and employs a learnable router to dynamically select token-specific expert subsets based on content. Simultaneously, a gated token selector identifies high-update-potential tokens for computation while reconstructing unselected tokens to preserve global context and sequence alignment. Training employs a two-stage knowledge distillation strategy, where the original VAR model supervises the learning of routing and gating policies to align with pretrained knowledge. Experiments on the ImageNet $256\times 256$ benchmark demonstrate that ActVAR achieves up to $21.2\%$ FLOPs reduction with minimal performance degradation.

ActVAR: Activating Mixtures of Weights and Tokens for Efficient Visual Autoregressive Generation

TL;DR

ActVAR addresses the computational bottleneck of Visual Autoregressive models by introducing dynamic dual sparsity: activating selective FFN experts and prioritizing informative tokens at each layer. It implements a Mixture-of-Experts FFN with a learnable router and a gated token selector, augmented by a reconstruction path to preserve global context. A two-stage knowledge-distillation training regime from a pretrained VAR teacher aligns routing and gating with pretrained dependencies, achieving up to FLOPs savings on ImageNet with minimal quality loss. The approach preserves full model capacity while adaptively allocating computation, offering practical acceleration for scalable autoregressive image generation.

Abstract

Visual Autoregressive (VAR) models enable efficient image generation via next-scale prediction but face escalating computational costs as sequence length grows. Existing static pruning methods degrade performance by permanently removing weights or tokens, disrupting pretrained dependencies. To address this, we propose ActVAR, a dynamic activation framework that introduces dual sparsity across model weights and token sequences to enhance efficiency without sacrificing capacity. ActVAR decomposes feedforward networks (FFNs) into lightweight expert sub-networks and employs a learnable router to dynamically select token-specific expert subsets based on content. Simultaneously, a gated token selector identifies high-update-potential tokens for computation while reconstructing unselected tokens to preserve global context and sequence alignment. Training employs a two-stage knowledge distillation strategy, where the original VAR model supervises the learning of routing and gating policies to align with pretrained knowledge. Experiments on the ImageNet benchmark demonstrate that ActVAR achieves up to FLOPs reduction with minimal performance degradation.

Paper Structure

This paper contains 24 sections, 22 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Comparisons of conventional weight and token pruning methods and our proposed ActVAR. Conventional methods permanently remove the weights and tokens, disrupting the dependencies. Our ActVAR achieves the same efficiency without sacrificing capacity.
  • Figure 2: The pipeline of our ActVAR. For the input sequence $q_{m-1}$, the selector generates a binary indicator vector $I$. Based on $I$, a filtered input $\hat{q}_{m-1}$ is constructed and passed through the attention and FFN layers. In the FFN, the router assigns each token to a subset of experts by predicting expert activation probabilities. Finally, the output $\hat{q}_m$ is reconstructed into the complete sequence $q_m$ using the indicator vector $I$, maintaining alignment with the original scale.
  • Figure 3: The number of tokens generated by VAR models at each step is increasing rapidly.
  • Figure 4: Distribution of computation per transformer block in VAR models.
  • Figure 5: Visualization of dynamic weight activations. The FFN is partitioned into 16 expert networks. At a $16\times 16$ scale, the above four images show the top-3 expert networks for all token activations in different steps. Notably, the activation patterns vary significantly across steps, as tokens at different steps exhibit distinct preferences in their activation weights.
  • ...and 4 more figures