Table of Contents
Fetching ...

FVAR: Visual Autoregressive Modeling via Next Focus Prediction

Xiaofan Li, Chenming Wu, Yanpeng Sun, Jiaming Zhou, Delin Qu, Yansong Qu, Weihao Bo, Haibao Yu, Dingkang Liang

TL;DR

FVAR tackles aliasing in visual autoregressive generation by replacing next-scale prediction with a next-focus paradigm that mirrors camera focusing. It introduces a physics-based progressive refocusing pyramid built from defocus PSF kernels, plus a dual-path tokenization and a High-Frequency Residual Teacher whose distillation preserves vanilla VAR deployment. The approach provides alias-free low-frequency content and informative high-frequency residuals, with AG-XAttn enabling selective alias information transfer during training. Across ImageNet at multiple resolutions, FVAR reduces jaggies and moiré while preserving fine details and text readability, without adding inference cost, marking a practical advancement for high-fidelity visual autoregression.

Abstract

Visual autoregressive models achieve remarkable generation quality through next-scale predictions across multi-scale token pyramids. However, the conventional method uses uniform scale downsampling to build these pyramids, leading to aliasing artifacts that compromise fine details and introduce unwanted jaggies and moiré patterns. To tackle this issue, we present \textbf{FVAR}, which reframes the paradigm from \emph{next-scale prediction} to \emph{next-focus prediction}, mimicking the natural process of camera focusing from blur to clarity. Our approach introduces three key innovations: \textbf{1) Next-Focus Prediction Paradigm} that transforms multi-scale autoregression by progressively reducing blur rather than simply downsampling; \textbf{2) Progressive Refocusing Pyramid Construction} that uses physics-consistent defocus kernels to build clean, alias-free multi-scale representations; and \textbf{3) High-Frequency Residual Learning} that employs a specialized residual teacher network to effectively incorporate alias information during training while maintaining deployment simplicity. Specifically, we construct optical low-pass views using defocus point spread function (PSF) kernels with decreasing radius, creating smooth blur-to-clarity transitions that eliminate aliasing at its source. To further enhance detail generation, we introduce a High-Frequency Residual Teacher that learns from both clean structure and alias residuals, distilling this knowledge to a vanilla VAR deployment network for seamless inference. Extensive experiments on ImageNet demonstrate that FVAR substantially reduces aliasing artifacts, improves fine detail preservation, and enhances text readability, achieving superior performance with perfect compatibility to existing VAR frameworks.

FVAR: Visual Autoregressive Modeling via Next Focus Prediction

TL;DR

FVAR tackles aliasing in visual autoregressive generation by replacing next-scale prediction with a next-focus paradigm that mirrors camera focusing. It introduces a physics-based progressive refocusing pyramid built from defocus PSF kernels, plus a dual-path tokenization and a High-Frequency Residual Teacher whose distillation preserves vanilla VAR deployment. The approach provides alias-free low-frequency content and informative high-frequency residuals, with AG-XAttn enabling selective alias information transfer during training. Across ImageNet at multiple resolutions, FVAR reduces jaggies and moiré while preserving fine details and text readability, without adding inference cost, marking a practical advancement for high-fidelity visual autoregression.

Abstract

Visual autoregressive models achieve remarkable generation quality through next-scale predictions across multi-scale token pyramids. However, the conventional method uses uniform scale downsampling to build these pyramids, leading to aliasing artifacts that compromise fine details and introduce unwanted jaggies and moiré patterns. To tackle this issue, we present \textbf{FVAR}, which reframes the paradigm from \emph{next-scale prediction} to \emph{next-focus prediction}, mimicking the natural process of camera focusing from blur to clarity. Our approach introduces three key innovations: \textbf{1) Next-Focus Prediction Paradigm} that transforms multi-scale autoregression by progressively reducing blur rather than simply downsampling; \textbf{2) Progressive Refocusing Pyramid Construction} that uses physics-consistent defocus kernels to build clean, alias-free multi-scale representations; and \textbf{3) High-Frequency Residual Learning} that employs a specialized residual teacher network to effectively incorporate alias information during training while maintaining deployment simplicity. Specifically, we construct optical low-pass views using defocus point spread function (PSF) kernels with decreasing radius, creating smooth blur-to-clarity transitions that eliminate aliasing at its source. To further enhance detail generation, we introduce a High-Frequency Residual Teacher that learns from both clean structure and alias residuals, distilling this knowledge to a vanilla VAR deployment network for seamless inference. Extensive experiments on ImageNet demonstrate that FVAR substantially reduces aliasing artifacts, improves fine detail preservation, and enhances text readability, achieving superior performance with perfect compatibility to existing VAR frameworks.

Paper Structure

This paper contains 27 sections, 11 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: FVAR achieves superior image generation quality. Our method generates images with significantly reduced aliasing artifacts (jaggies, moiré patterns) while preserving fine details and text readability compared to standard VAR. The progressive refocusing paradigm enables clean multi-scale representations that lead to sharper, more realistic results. Results shown here are from models trained on additional large-scale datasets (see supplementary material for details).
  • Figure 2: Progressive Refocusing vs. Uniform Downsampling. Our method shifts the paradigm from "next-scale prediction" to "next-focus prediction." (Left) Standard VAR uses uniform downsampling, introducing aliasing artifacts from coarse to fine scales. (Right) Our proposed FVAR employs progressive refocusing with decreasing PSF radius, mimicking camera focusing from blur to clarity. This physics-consistent approach eliminates aliasing at the source while preserving fine details through dual-path tokenization.
  • Figure 3: High-Frequency Residual Teacher Training Architecture. Our approach employs dual networks during training: the High-Frequency Residual Teacher (top) processes both structure tokens $r_k$ and alias tokens $a_k$ through Alias-Gate Cross-Attention, while the Deployment Network (bottom) only uses structure tokens to maintain vanilla VAR compatibility. Residual knowledge transfer enables the deployment network to benefit from high-frequency information during training while ensuring zero inference overhead.
  • Figure 4: Visual quality comparison between VAR and FVAR. To compare the quality of spatial hierarchy and high-frequency details, these are visualization results at 1024$\times$1024 resolution. The first row shows image generation, and the second row shows inpainting and outpainting (solid red boxes indicate input regions). In each group, VAR is on the left and FVAR is on the right. Dashed red boxes highlight key regions of interest.