Table of Contents
Fetching ...

Progressive Supernet Training for Efficient Visual Autoregressive Modeling

Xiaoyue Chen, Yuling Shi, Kaiyuan Li, Huandong Wang, Yong Li, Xiaodong Gu, Xinlei Chen, Mingbao Lin

TL;DR

This work tackles the memory bottleneck in visual autoregressive modeling by introducing VARiant, a single unified supernet that enables runtime depth switching across scales without duplicating models. By exploiting a scale-depth asymmetric dependency, early scales are processed with full depth while later scales employ lightweight subnets that share weights with the full network. A progressive three-phase training strategy dynamically allocates gradient flow to avoid optimization conflicts, breaking the Pareto frontier observed in fixed-ratio training. Empirical results on ImageNet (256×256) show that VARiant-d16 achieves near-original quality with substantial memory savings (40–65%), while deeper speedups (up to 3.5×) are possible with greater quality trade-offs, all within a single deployable model that supports zero-cost depth switching. This approach offers flexible deployment options from high-quality to extreme efficiency, enabling practical VAR-based generation across diverse applications.

Abstract

Visual Auto-Regressive (VAR) models significantly reduce inference steps through the "next-scale" prediction paradigm. However, progressive multi-scale generation incurs substantial memory overhead due to cumulative KV caching, limiting practical deployment. We observe a scale-depth asymmetric dependency in VAR: early scales exhibit extreme sensitivity to network depth, while later scales remain robust to depth reduction. Inspired by this, we propose VARiant: by equidistant sampling, we select multiple subnets ranging from 16 to 2 layers from the original 30-layer VAR-d30 network. Early scales are processed by the full network, while later scales utilize subnet. Subnet and the full network share weights, enabling flexible depth adjustment within a single model. However, weight sharing between subnet and the entire network can lead to optimization conflicts. To address this, we propose a progressive training strategy that breaks through the Pareto frontier of generation quality for both subnets and the full network under fixed-ratio training, achieving joint optimality. Experiments on ImageNet demonstrate that, compared to the pretrained VAR-d30 (FID 1.95), VARiant-d16 and VARiant-d8 achieve nearly equivalent quality (FID 2.05/2.12) while reducing memory consumption by 40-65%. VARiant-d2 achieves 3.5 times speedup and 80% memory reduction at moderate quality cost (FID 2.97). In terms of deployment, VARiant's single-model architecture supports zero-cost runtime depth switching and provides flexible deployment options from high quality to extreme efficiency, catering to diverse application scenarios.

Progressive Supernet Training for Efficient Visual Autoregressive Modeling

TL;DR

This work tackles the memory bottleneck in visual autoregressive modeling by introducing VARiant, a single unified supernet that enables runtime depth switching across scales without duplicating models. By exploiting a scale-depth asymmetric dependency, early scales are processed with full depth while later scales employ lightweight subnets that share weights with the full network. A progressive three-phase training strategy dynamically allocates gradient flow to avoid optimization conflicts, breaking the Pareto frontier observed in fixed-ratio training. Empirical results on ImageNet (256×256) show that VARiant-d16 achieves near-original quality with substantial memory savings (40–65%), while deeper speedups (up to 3.5×) are possible with greater quality trade-offs, all within a single deployable model that supports zero-cost depth switching. This approach offers flexible deployment options from high-quality to extreme efficiency, enabling practical VAR-based generation across diverse applications.

Abstract

Visual Auto-Regressive (VAR) models significantly reduce inference steps through the "next-scale" prediction paradigm. However, progressive multi-scale generation incurs substantial memory overhead due to cumulative KV caching, limiting practical deployment. We observe a scale-depth asymmetric dependency in VAR: early scales exhibit extreme sensitivity to network depth, while later scales remain robust to depth reduction. Inspired by this, we propose VARiant: by equidistant sampling, we select multiple subnets ranging from 16 to 2 layers from the original 30-layer VAR-d30 network. Early scales are processed by the full network, while later scales utilize subnet. Subnet and the full network share weights, enabling flexible depth adjustment within a single model. However, weight sharing between subnet and the entire network can lead to optimization conflicts. To address this, we propose a progressive training strategy that breaks through the Pareto frontier of generation quality for both subnets and the full network under fixed-ratio training, achieving joint optimality. Experiments on ImageNet demonstrate that, compared to the pretrained VAR-d30 (FID 1.95), VARiant-d16 and VARiant-d8 achieve nearly equivalent quality (FID 2.05/2.12) while reducing memory consumption by 40-65%. VARiant-d2 achieves 3.5 times speedup and 80% memory reduction at moderate quality cost (FID 2.97). In terms of deployment, VARiant's single-model architecture supports zero-cost runtime depth switching and provides flexible deployment options from high quality to extreme efficiency, catering to diverse application scenarios.

Paper Structure

This paper contains 31 sections, 5 equations, 7 figures, 8 tables, 2 algorithms.

Figures (7)

  • Figure 1: VARiant inference and training framework.
  • Figure 2: Fixed-ratio training exhibits (a) Pareto trade-offs, (b) optimization conflicts at extreme ratios, and (c) time-varying optimal ratios, motivating our progressive training strategy.
  • Figure 3: Progressive training strategy. (a) Dynamic sampling ratio schedule across three training phases. (b) Gradient source analysis showing the transition from joint optimization to subnet-focused refinement through a stable gradient bridge.
  • Figure 4: Visual quality comparison across different depth configurations. All configurations maintain high visual quality with significant memory reduction and inference speedup.
  • Figure 5: Configuration parameter analysis. (a) Impact of subnet depth $D$ (fixed $N$=6): quality vs. memory trade-off. (b) Impact of early-scale count $N$ (fixed $D$=16): diminishing returns with increasing $N$. (c) Configuration space: colored trajectories for different $D$ values, marker size indicates $N$. Red star: recommended configuration ($D$=16, $N$=7) with FID 2.00 and 36% memory reduction.
  • ...and 2 more figures