Table of Contents
Fetching ...

Toward Guidance-Free AR Visual Generation via Condition Contrastive Alignment

Huayu Chen, Hang Su, Peize Sun, Jun Zhu

TL;DR

Experimental results show that CCA can significantly enhance the guidance-free performance of all tested models with just one epoch of fine-tuning on the pretraining dataset, on par with guided sampling methods, and largely removes the need for guided sampling in AR visual generation and cuts the sampling cost by half.

Abstract

Classifier-Free Guidance (CFG) is a critical technique for enhancing the sample quality of visual generative models. However, in autoregressive (AR) multi-modal generation, CFG introduces design inconsistencies between language and visual content, contradicting the design philosophy of unifying different modalities for visual AR. Motivated by language model alignment methods, we propose \textit{Condition Contrastive Alignment} (CCA) to facilitate guidance-free AR visual generation with high performance and analyze its theoretical connection with guided sampling methods. Unlike guidance methods that alter the sampling process to achieve the ideal sampling distribution, CCA directly fine-tunes pretrained models to fit the same distribution target. Experimental results show that CCA can significantly enhance the guidance-free performance of all tested models with just one epoch of fine-tuning ($\sim$ 1\% of pretraining epochs) on the pretraining dataset, on par with guided sampling methods. This largely removes the need for guided sampling in AR visual generation and cuts the sampling cost by half. Moreover, by adjusting training parameters, CCA can achieve trade-offs between sample diversity and fidelity similar to CFG. This experimentally confirms the strong theoretical connection between language-targeted alignment and visual-targeted guidance methods, unifying two previously independent research fields. Code and model weights: https://github.com/thu-ml/CCA.

Toward Guidance-Free AR Visual Generation via Condition Contrastive Alignment

TL;DR

Experimental results show that CCA can significantly enhance the guidance-free performance of all tested models with just one epoch of fine-tuning on the pretraining dataset, on par with guided sampling methods, and largely removes the need for guided sampling in AR visual generation and cuts the sampling cost by half.

Abstract

Classifier-Free Guidance (CFG) is a critical technique for enhancing the sample quality of visual generative models. However, in autoregressive (AR) multi-modal generation, CFG introduces design inconsistencies between language and visual content, contradicting the design philosophy of unifying different modalities for visual AR. Motivated by language model alignment methods, we propose \textit{Condition Contrastive Alignment} (CCA) to facilitate guidance-free AR visual generation with high performance and analyze its theoretical connection with guided sampling methods. Unlike guidance methods that alter the sampling process to achieve the ideal sampling distribution, CCA directly fine-tunes pretrained models to fit the same distribution target. Experimental results show that CCA can significantly enhance the guidance-free performance of all tested models with just one epoch of fine-tuning ( 1\% of pretraining epochs) on the pretraining dataset, on par with guided sampling methods. This largely removes the need for guided sampling in AR visual generation and cuts the sampling cost by half. Moreover, by adjusting training parameters, CCA can achieve trade-offs between sample diversity and fidelity similar to CFG. This experimentally confirms the strong theoretical connection between language-targeted alignment and visual-targeted guidance methods, unifying two previously independent research fields. Code and model weights: https://github.com/thu-ml/CCA.

Paper Structure

This paper contains 32 sections, 3 theorems, 28 equations, 9 figures, 4 tables.

Key Result

Theorem 3.1

Let $r_\theta$ be a parameterized model which takes in an image-condition pair $({\bm{x}}, {\bm{c}})$ and outputs a scalar value $r_\theta({\bm{x}}, {\bm{c}})$. Consider the loss function: Given unlimited model expressivity for $r_\theta$, the optimal solution for minimizing $\mathcal{L}^{\text{NCE}}_\theta$ satisfies

Figures (9)

  • Figure 1: CCA significantly improves guidance-free sample quality for AR visual generative models with just one epoch of fine-tuning on the pretraining dataset.
  • Figure 2: An overview of the CCA method.
  • Figure 3: CCA and CFG can similarly enhance the sample fidelity of AR visual models. The base models are LlamaGen-L (343M) and VAR-d24 (1.0B). We use $s=3.0$ for CFG, and $\beta=0.02, \lambda=10^4$ for CCA. Figure \ref{['fig:picllamagen']} and Figure \ref{['fig:picvar']} contain more examples.
  • Figure 4: CCA can achieve similar FID-IS trade-offs to CFG by adjusting training parameter $\lambda$.
  • Figure 5: The impact of training parameter $\lambda$ on the performance of CCA+CFG.
  • ...and 4 more figures

Theorems & Definitions (5)

  • Theorem 3.1: Noise Contrastive Estimation, proof in Appendix \ref{['sec:proofs']}
  • Theorem A.1: Noise Contrastive Estimation
  • proof
  • Theorem B.1
  • proof