Table of Contents
Fetching ...

FARMER: Flow AutoRegressive Transformer over Pixels

Guangting Zheng, Qinyu Zhao, Tao Yang, Fei Xiao, Zhijie Lin, Jie Wu, Jiajun Deng, Yanyong Zhang, Rui Zhu

TL;DR

FARMER addresses pixel-space image generation by unifying Normalizing Flows and Autoregressive models into an end-to-end framework that preserves exact likelihoods $p_{data}(x)$ while leveraging AR expressivity. It maps images to latent sequences through an Autoregressive Flow and models their distribution with a Gaussian Mixture autoregressor, augmented by a self-supervised dimension reduction that separates informative $Z^I$ from redundant $Z^R$ channels. A one-step distillation method accelerates inference and a resampling-based classifier-free guidance (CFG) enhances generation quality, achieving competitive results on ImageNet-256 without quantizing pixels. This approach advances pixel-space generation by combining tractable likelihoods with powerful autoregressive modeling and scalable training, offering practical benefits for high-fidelity image synthesis and exact density estimation.

Abstract

Directly modeling the explicit likelihood of the raw data distribution is key topic in the machine learning area, which achieves the scaling successes in Large Language Models by autoregressive modeling. However, continuous AR modeling over visual pixel data suffer from extremely long sequences and high-dimensional spaces. In this paper, we present FARMER, a novel end-to-end generative framework that unifies Normalizing Flows (NF) and Autoregressive (AR) models for tractable likelihood estimation and high-quality image synthesis directly from raw pixels. FARMER employs an invertible autoregressive flow to transform images into latent sequences, whose distribution is modeled implicitly by an autoregressive model. To address the redundancy and complexity in pixel-level modeling, we propose a self-supervised dimension reduction scheme that partitions NF latent channels into informative and redundant groups, enabling more effective and efficient AR modeling. Furthermore, we design a one-step distillation scheme to significantly accelerate inference speed and introduce a resampling-based classifier-free guidance algorithm to boost image generation quality. Extensive experiments demonstrate that FARMER achieves competitive performance compared to existing pixel-based generative models while providing exact likelihoods and scalable training.

FARMER: Flow AutoRegressive Transformer over Pixels

TL;DR

FARMER addresses pixel-space image generation by unifying Normalizing Flows and Autoregressive models into an end-to-end framework that preserves exact likelihoods while leveraging AR expressivity. It maps images to latent sequences through an Autoregressive Flow and models their distribution with a Gaussian Mixture autoregressor, augmented by a self-supervised dimension reduction that separates informative from redundant channels. A one-step distillation method accelerates inference and a resampling-based classifier-free guidance (CFG) enhances generation quality, achieving competitive results on ImageNet-256 without quantizing pixels. This approach advances pixel-space generation by combining tractable likelihoods with powerful autoregressive modeling and scalable training, offering practical benefits for high-fidelity image synthesis and exact density estimation.

Abstract

Directly modeling the explicit likelihood of the raw data distribution is key topic in the machine learning area, which achieves the scaling successes in Large Language Models by autoregressive modeling. However, continuous AR modeling over visual pixel data suffer from extremely long sequences and high-dimensional spaces. In this paper, we present FARMER, a novel end-to-end generative framework that unifies Normalizing Flows (NF) and Autoregressive (AR) models for tractable likelihood estimation and high-quality image synthesis directly from raw pixels. FARMER employs an invertible autoregressive flow to transform images into latent sequences, whose distribution is modeled implicitly by an autoregressive model. To address the redundancy and complexity in pixel-level modeling, we propose a self-supervised dimension reduction scheme that partitions NF latent channels into informative and redundant groups, enabling more effective and efficient AR modeling. Furthermore, we design a one-step distillation scheme to significantly accelerate inference speed and introduce a resampling-based classifier-free guidance algorithm to boost image generation quality. Extensive experiments demonstrate that FARMER achieves competitive performance compared to existing pixel-based generative models while providing exact likelihoods and scalable training.

Paper Structure

This paper contains 21 sections, 25 equations, 8 figures, 6 tables, 2 algorithms.

Figures (8)

  • Figure 1: Autoregressive (AR) models offer strong expressivity but struggle with pixel modeling and sampling due to the long sequences required for high-resolution images. Normalizing flows (NFs) employ invertible mappings to transform complex image distributions to a standard Gaussian, but the substantial gap between two distributions leads to degraded sampling quality. FARMER unifies NF and AR within a single framework, using the NF component to transform images into latent sequences, whose distribution is implicitly modeled by the AR component for easier modeling and controllable sampling. Furthermore, FARMER adopts a self-supervised dimension reduction method to partition NF latent channels into distinct groups, making AR modeling feasible and scalable.
  • Figure 2: Overview of FARMER. Left, FARMER consists an autoregressive flow (AF) and an autoregressive (AR) model. The AF maps image patches to latent sequences, while the AR predicts Gaussian Mixture Models (GMMs) conditioned on these latents, optimizing their likelihood end-to-end. Middle, Each AF block performs an invertible next-token transformation of the input sequence to obtain a new sequence. Right, AR splits latent channels into informative and redundant groups, modeling each informative token’s likelihood via a GMM conditioned on its previous tokens, and redundant tokens jointly via a shared GMM conditioned on all informative tokens. This separation enables disentangling structural and detailed information.
  • Figure 3: One-Step Distillation. (a) The autoregressive flow (AF) reverse process reconstructs tokens sequentially, conditioning each token on previous ones, which leads to slow inference. (b) Our method distills a one-step student reverse path from the frozen teacher forward path in an end-to-end manner, approximating the reverse process of each AF block by the corresponding student AF block’s forward process, thereby enabling $22\times$ faster AF reverse process and $4\times$ overall inference speed-up.
  • Figure 4: Qualitative Results. Images generated by FARMER on ImageNet 256x256.
  • Figure 5: Qualitative Comparison. Images of class 0 in ImageNet generated by FARMER, MAR, and DiT.
  • ...and 3 more figures