Table of Contents
Fetching ...

HOFAR: High-Order Augmentation of Flow Autoregressive Transformers

Yingyu Liang, Zhizhou Sha, Zhenmei Shi, Zhao Song, Mingda Wan

TL;DR

Problem: improve fidelity and long-range coherence in flow-based autoregressive image generation. Approach: introduce HOFAR to incorporate high-order trajectory supervision into FlowAR, accompanied by theoretical efficiency guarantees and empirical validation. Contributions: a formal framework for high-order dynamics, complexity bound $O(k m n^4 d^2)$, and CIFAR-10 experiments showing improved realism and coherence over FlowAR baselines. Significance: enables more realistic and coherent generation at scale and paves the way for multi-modal extensions and broader applicability of high-order trajectory modeling in generative systems.

Abstract

Flow Matching and Transformer architectures have demonstrated remarkable performance in image generation tasks, with recent work FlowAR [Ren et al., 2024] synergistically integrating both paradigms to advance synthesis fidelity. However, current FlowAR implementations remain constrained by first-order trajectory modeling during the generation process. This paper introduces a novel framework that systematically enhances flow autoregressive transformers through high-order supervision. We provide theoretical analysis and empirical evaluation showing that our High-Order FlowAR (HOFAR) demonstrates measurable improvements in generation quality compared to baseline models. The proposed approach advances the understanding of flow-based autoregressive modeling by introducing a systematic framework for analyzing trajectory dynamics through high-order expansion.

HOFAR: High-Order Augmentation of Flow Autoregressive Transformers

TL;DR

Problem: improve fidelity and long-range coherence in flow-based autoregressive image generation. Approach: introduce HOFAR to incorporate high-order trajectory supervision into FlowAR, accompanied by theoretical efficiency guarantees and empirical validation. Contributions: a formal framework for high-order dynamics, complexity bound , and CIFAR-10 experiments showing improved realism and coherence over FlowAR baselines. Significance: enables more realistic and coherent generation at scale and paves the way for multi-modal extensions and broader applicability of high-order trajectory modeling in generative systems.

Abstract

Flow Matching and Transformer architectures have demonstrated remarkable performance in image generation tasks, with recent work FlowAR [Ren et al., 2024] synergistically integrating both paradigms to advance synthesis fidelity. However, current FlowAR implementations remain constrained by first-order trajectory modeling during the generation process. This paper introduces a novel framework that systematically enhances flow autoregressive transformers through high-order supervision. We provide theoretical analysis and empirical evaluation showing that our High-Order FlowAR (HOFAR) demonstrates measurable improvements in generation quality compared to baseline models. The proposed approach advances the understanding of flow-based autoregressive modeling by introducing a systematic framework for analyzing trajectory dynamics through high-order expansion.

Paper Structure

This paper contains 21 sections, 5 theorems, 2 equations, 5 figures, 2 algorithms.

Key Result

Theorem 4.1

In accordance with Definition def:ar_transformer, the auto-regressive Transformer architecture incorporates $m$ attention layers. The image input $x_{\mathrm{img}} \in \mathbb{R}^{n \times n \times c}$ is encoded with $n^2$ spatial units, $c$ channels, and a $d$-dimensional latent representation. Th

Figures (5)

  • Figure 1: Loss curve of FlowAR-small (Left), loss curve of FlowAR-large (Right) and loss curve of HOFAR (Bottom).
  • Figure 2: Comparison of 32*32 CIFAR-10 images generation by FlowAR-small (first four lines), FlowAR-large (second four lines) and HOFAR (last four lines). For better looking, we put higher-resolution version of Figure \ref{['fig:app:visual_flowar']}, Figure \ref{['fig:app:visual_flowar_l']} and Figure \ref{['fig:app:visual_hofar']} here.
  • Figure 3: 64 32*32 images generated by FlowAR-small.
  • Figure 4: 64 32*32 images generated by FlowAR-large.
  • Figure 5: 64 32*32 images generated by HOFAR.

Theorems & Definitions (20)

  • Definition 3.1: Linear Downsampling Function
  • Definition 3.2: Multi-Scale Downsampling Tokenizer
  • Definition 3.3: Upsampling Function
  • Definition 3.4: Attention Layer
  • Definition 3.5: Feed Forward Layer
  • Definition 3.6: Autoregressive Transformer
  • Definition 3.7: Flow
  • Definition 3.8: MLP Layer
  • Definition 3.9: Layer Normalization Layer
  • Definition 3.10: Flow Matching Architecture
  • ...and 10 more