Table of Contents
Fetching ...

AlignVAR: Towards Globally Consistent Visual Autoregression for Image Super-Resolution

Cencen Liu, Dongyang Zhang, Wen Yin, Jielei Wang, Tianyu Li, Ji Guo, Wenbo Jiang, Guoqing Wang, Guoming Lu

TL;DR

AlignVAR is proposed, a globally consistent visual autoregressive framework tailored for ISR, featuring two key components: Spatial Consistency Autoregression (SCA), which applies an adaptive mask to reweight attention toward structurally correlated regions, thereby mitigating excessive locality and enhancing long-range dependencies.

Abstract

Visual autoregressive (VAR) models have recently emerged as a promising alternative for image generation, offering stable training, non-iterative inference, and high-fidelity synthesis through next-scale prediction. This encourages the exploration of VAR for image super-resolution (ISR), yet its application remains underexplored and faces two critical challenges: locality-biased attention, which fragments spatial structures, and residual-only supervision, which accumulates errors across scales, severely compromises global consistency of reconstructed images. To address these issues, we propose AlignVAR, a globally consistent visual autoregressive framework tailored for ISR, featuring two key components: (1) Spatial Consistency Autoregression (SCA), which applies an adaptive mask to reweight attention toward structurally correlated regions, thereby mitigating excessive locality and enhancing long-range dependencies; and (2) Hierarchical Consistency Constraint (HCC), which augments residual learning with full reconstruction supervision at each scale, exposing accumulated deviations early and stabilizing the coarse-to-fine refinement process. Extensive experiments demonstrate that AlignVAR consistently enhances structural coherence and perceptual fidelity over existing generative methods, while delivering over 10x faster inference with nearly 50% fewer parameters than leading diffusion-based approaches, establishing a new paradigm for efficient ISR.

AlignVAR: Towards Globally Consistent Visual Autoregression for Image Super-Resolution

TL;DR

AlignVAR is proposed, a globally consistent visual autoregressive framework tailored for ISR, featuring two key components: Spatial Consistency Autoregression (SCA), which applies an adaptive mask to reweight attention toward structurally correlated regions, thereby mitigating excessive locality and enhancing long-range dependencies.

Abstract

Visual autoregressive (VAR) models have recently emerged as a promising alternative for image generation, offering stable training, non-iterative inference, and high-fidelity synthesis through next-scale prediction. This encourages the exploration of VAR for image super-resolution (ISR), yet its application remains underexplored and faces two critical challenges: locality-biased attention, which fragments spatial structures, and residual-only supervision, which accumulates errors across scales, severely compromises global consistency of reconstructed images. To address these issues, we propose AlignVAR, a globally consistent visual autoregressive framework tailored for ISR, featuring two key components: (1) Spatial Consistency Autoregression (SCA), which applies an adaptive mask to reweight attention toward structurally correlated regions, thereby mitigating excessive locality and enhancing long-range dependencies; and (2) Hierarchical Consistency Constraint (HCC), which augments residual learning with full reconstruction supervision at each scale, exposing accumulated deviations early and stabilizing the coarse-to-fine refinement process. Extensive experiments demonstrate that AlignVAR consistently enhances structural coherence and perceptual fidelity over existing generative methods, while delivering over 10x faster inference with nearly 50% fewer parameters than leading diffusion-based approaches, establishing a new paradigm for efficient ISR.
Paper Structure (32 sections, 13 equations, 10 figures, 5 tables)

This paper contains 32 sections, 13 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Comparison between the VARSR and AlignVAR. AlignVAR enhances VAR by introducing an adaptive consistency mask for intra-scale modeling and full reconstruction supervision for inter-scale alignment.
  • Figure 2: Comparison of attention distribution. Visualization of attention maps for VARSR and AlignVAR shows that VARSR exhibits highly localized attention concentrated in nearby regions, whereas AlignVAR captures broader contextual dependencies through the proposed Spatial Consistency Autoregression (SCA), thereby enhancing spatial coherence within each scale.
  • Figure 3: Spatial inconsistency results in texture discontinuities, structural distortions.
  • Figure 4: Hierarchical inconsistency results in color shifts and structural misalignment.
  • Figure 5: Overall architecture of the proposed AlignVAR. AlignVAR comprises two complementary components: a Spatial Consistency Autoregression (SCA) that performs scale-wise prediction and reweights intra-scale features using adaptive masks, and a Hierarchical Consistency Constraint (HCC) that jointly supervises residual and full representations to recalibrate inter-scale dependencies.
  • ...and 5 more figures