Table of Contents
Fetching ...

Neon: Negative Extrapolation From Self-Training Improves Image Generation

Sina Alemohammad, Zhangyang Wang, Richard G. Baraniuk

TL;DR

Neon addresses data scarcity in generative modeling by treating degradation from self-training as a diagnostic signal and applying negative extrapolation. After a brief self-training step on synthetic data, Neon forms a corrected model via $\theta_{\text{Neon}}=(1+w)\theta_r-w\theta_s$, effectively moving away from the degraded weights. The authors prove that mode-seeking samplers induce anti-alignment between synthetic and population gradients, enabling Neon to reduce the true data risk under mild conditions, and show universal applicability across diffusion, flow matching, autoregressive, and few-step models. Empirically, Neon delivers state-of-the-art or competitive Fréchet Inception Distance (FID) improvements on CIFAR-10, FFHQ, and ImageNet across multiple architectures with less than 1-3% extra training compute, and even achieves a record FID of 1.02 on ImageNet-256 with xAR-L. The work demonstrates a simple, data-efficient post-processing technique that leverages the degradation signal to improve sample quality and diversity in data-scarce regimes, with broad practical impact for large-scale image generation.

Abstract

Scaling generative AI models is bottlenecked by the scarcity of high-quality training data. The ease of synthesizing from a generative model suggests using (unverified) synthetic data to augment a limited corpus of real data for the purpose of fine-tuning in the hope of improving performance. Unfortunately, however, the resulting positive feedback loop leads to model autophagy disorder (MAD, aka model collapse) that results in a rapid degradation in sample quality and/or diversity. In this paper, we introduce Neon (for Negative Extrapolation frOm self-traiNing), a new learning method that turns the degradation from self-training into a powerful signal for self-improvement. Given a base model, Neon first fine-tunes it on its own self-synthesized data but then, counterintuitively, reverses its gradient updates to extrapolate away from the degraded weights. We prove that Neon works because typical inference samplers that favor high-probability regions create a predictable anti-alignment between the synthetic and real data population gradients, which negative extrapolation corrects to better align the model with the true data distribution. Neon is remarkably easy to implement via a simple post-hoc merge that requires no new real data, works effectively with as few as 1k synthetic samples, and typically uses less than 1% additional training compute. We demonstrate Neon's universality across a range of architectures (diffusion, flow matching, autoregressive, and inductive moment matching models) and datasets (ImageNet, CIFAR-10, and FFHQ). In particular, on ImageNet 256x256, Neon elevates the xAR-L model to a new state-of-the-art FID of 1.02 with only 0.36% additional training compute. Code is available at https://github.com/VITA-Group/Neon

Neon: Negative Extrapolation From Self-Training Improves Image Generation

TL;DR

Neon addresses data scarcity in generative modeling by treating degradation from self-training as a diagnostic signal and applying negative extrapolation. After a brief self-training step on synthetic data, Neon forms a corrected model via , effectively moving away from the degraded weights. The authors prove that mode-seeking samplers induce anti-alignment between synthetic and population gradients, enabling Neon to reduce the true data risk under mild conditions, and show universal applicability across diffusion, flow matching, autoregressive, and few-step models. Empirically, Neon delivers state-of-the-art or competitive Fréchet Inception Distance (FID) improvements on CIFAR-10, FFHQ, and ImageNet across multiple architectures with less than 1-3% extra training compute, and even achieves a record FID of 1.02 on ImageNet-256 with xAR-L. The work demonstrates a simple, data-efficient post-processing technique that leverages the degradation signal to improve sample quality and diversity in data-scarce regimes, with broad practical impact for large-scale image generation.

Abstract

Scaling generative AI models is bottlenecked by the scarcity of high-quality training data. The ease of synthesizing from a generative model suggests using (unverified) synthetic data to augment a limited corpus of real data for the purpose of fine-tuning in the hope of improving performance. Unfortunately, however, the resulting positive feedback loop leads to model autophagy disorder (MAD, aka model collapse) that results in a rapid degradation in sample quality and/or diversity. In this paper, we introduce Neon (for Negative Extrapolation frOm self-traiNing), a new learning method that turns the degradation from self-training into a powerful signal for self-improvement. Given a base model, Neon first fine-tunes it on its own self-synthesized data but then, counterintuitively, reverses its gradient updates to extrapolate away from the degraded weights. We prove that Neon works because typical inference samplers that favor high-probability regions create a predictable anti-alignment between the synthetic and real data population gradients, which negative extrapolation corrects to better align the model with the true data distribution. Neon is remarkably easy to implement via a simple post-hoc merge that requires no new real data, works effectively with as few as 1k synthetic samples, and typically uses less than 1% additional training compute. We demonstrate Neon's universality across a range of architectures (diffusion, flow matching, autoregressive, and inductive moment matching models) and datasets (ImageNet, CIFAR-10, and FFHQ). In particular, on ImageNet 256x256, Neon elevates the xAR-L model to a new state-of-the-art FID of 1.02 with only 0.36% additional training compute. Code is available at https://github.com/VITA-Group/Neon

Paper Structure

This paper contains 74 sections, 15 theorems, 101 equations, 17 figures, 1 table, 1 algorithm.

Key Result

Theorem 1

Let $K:=H_d^{1/2}PH_d^{1/2}$ with spectral bounds $mI\preceq K\preceq MI$. Then the alignment $s=\langle r_d,\,P r_s\rangle$ obeys Consequently, a sufficient condition for $s<0$ is that the leading two terms on the right-hand side be negative. In particular, for $\cos\varphi<0$ and sufficiently small $\|\varepsilon\|_{H_d}$,

Figures (17)

  • Figure 1: Good to great: Neon's state-of-the-art performance on ImageNet-256. Neon elevates a powerful baseline generative model (xAR-L, top row) to a new level of sharpness and realism (bottom row) with a simple post-hoc merge. This leap in quality, improving the Fréchet Inception Distance (FID) from 1.28 to a record-breaking 1.02, is accomplished with only 0.36% extra training compute.
  • Figure 2: Neon consistently improves FID with minimal self-training overhead. Minimum FID (optimized over extrapolation strength $w$) vs. self-training budget $\mathcal{B}$ (millions of images seen during fine-tuning on $\mathcal{S}$) for varying synthetic dataset sizes $|\mathcal{S}|$, on EDM-VP (CIFAR-10/FFHQ-64) and flow matching (CIFAR-10). Optimal gains use $\mathcal{B} \le 3$Mi ($<2\%$ of base model training compute for EDM; $<3\%$ for flow), confirming Neon's efficiency. At $\mathcal{B}=0$, FID reflects the base model (no Neon).
  • Figure 3: Neon trades precision for recall, yielding net FID improvement. For the EDM-VP model trained on CIFAR-10, we plot the FID, precision, and recall vs. negative extrapolation strength $w$ for various training budgets $\mathcal{B}$. In each case, $|\mathcal{S}| = 6$k.
  • Figure 4: Neon consistently improves autoregressive models across architectures and resolutions. We plot the minimum FID (optimized over merge weight $w$ and CFG scale $\gamma$) versus the fine-tuning budget $\mathcal{B}$ for various synthetic dataset sizes $|\mathcal{S}|$. From left: xAR-B and xAR-L on ImageNet-256 (with xAR-L achieving a state-of-the-art 1.02 FID), VAR-d16 on ImageNet-256, and VAR-d30 on ImageNet-512.
  • Figure 5: Optimal precision-recall trade-offs for VAR-d16 as a function of $w$ and $\gamma$. Left: Heatmaps for FID, precision, and recall on ImageNet-256 ($|\mathcal{S}|{=}750$k, $\mathcal{B}{=}1.25$Mi) from a grid search over $w$ and $\gamma$. The star marks the best FID $(w^*{\approx}1.0, \gamma^*{\approx}2.7)$ achieving FID 2.01, unreachable by either parameter alone. Right: Asymptotic precision-recall curves showing expanded behavioral range through joint tuning.
  • ...and 12 more figures

Theorems & Definitions (28)

  • Theorem 1: Anti-alignment under inference mismatch
  • Theorem B.1: One-step Neon improvement
  • proof
  • Remark B.2: No convexity needed: directional smoothness
  • Lemma B.3: First-order expansions of real and synthetic gradients
  • proof
  • Theorem B.4: Directional upper bound for $s$
  • proof
  • Corollary B.5: Natural-gradient geometry
  • Lemma B.6: Mode-seeking $\Rightarrow$ $\cos\varphi<0$ (first order)
  • ...and 18 more