Direct Ascent Synthesis: Revealing Hidden Generative Capabilities in Discriminative Models
Stanislav Fort, Jonathan Whitaker
TL;DR
Direct Ascent Synthesis (DAS) shows that pretrained discriminative models inherently encode rich generative capabilities. By optimizing across a multi-resolution decomposition of images to maximize CLIP-embedding similarity, DAS achieves training-free, high-quality synthesis that preserves natural image statistics (notably a $1/f^2$ spectrum) and avoids degenerate adversarial patterns. The method leverages simple priors, augmentations, and CLIP model ensembles to enable tasks such as text-to-image generation, style transfer, and reconstruction from embeddings, challenging the traditional dichotomy between discriminative and generative modeling. The work offers both practical benefits (reduced training and data needs) and theoretical implications for model interpretability, robustness, and the fundamental nature of visual representations.
Abstract
We demonstrate that discriminative models inherently contain powerful generative capabilities, challenging the fundamental distinction between discriminative and generative architectures. Our method, Direct Ascent Synthesis (DAS), reveals these latent capabilities through multi-resolution optimization of CLIP model representations. While traditional inversion attempts produce adversarial patterns, DAS achieves high-quality image synthesis by decomposing optimization across multiple spatial scales (1x1 to 224x224), requiring no additional training. This approach not only enables diverse applications -- from text-to-image generation to style transfer -- but maintains natural image statistics ($1/f^2$ spectrum) and guides the generation away from non-robust adversarial patterns. Our results demonstrate that standard discriminative models encode substantially richer generative knowledge than previously recognized, providing new perspectives on model interpretability and the relationship between adversarial examples and natural image synthesis.
