Table of Contents
Fetching ...

Direct Ascent Synthesis: Revealing Hidden Generative Capabilities in Discriminative Models

Stanislav Fort, Jonathan Whitaker

TL;DR

Direct Ascent Synthesis (DAS) shows that pretrained discriminative models inherently encode rich generative capabilities. By optimizing across a multi-resolution decomposition of images to maximize CLIP-embedding similarity, DAS achieves training-free, high-quality synthesis that preserves natural image statistics (notably a $1/f^2$ spectrum) and avoids degenerate adversarial patterns. The method leverages simple priors, augmentations, and CLIP model ensembles to enable tasks such as text-to-image generation, style transfer, and reconstruction from embeddings, challenging the traditional dichotomy between discriminative and generative modeling. The work offers both practical benefits (reduced training and data needs) and theoretical implications for model interpretability, robustness, and the fundamental nature of visual representations.

Abstract

We demonstrate that discriminative models inherently contain powerful generative capabilities, challenging the fundamental distinction between discriminative and generative architectures. Our method, Direct Ascent Synthesis (DAS), reveals these latent capabilities through multi-resolution optimization of CLIP model representations. While traditional inversion attempts produce adversarial patterns, DAS achieves high-quality image synthesis by decomposing optimization across multiple spatial scales (1x1 to 224x224), requiring no additional training. This approach not only enables diverse applications -- from text-to-image generation to style transfer -- but maintains natural image statistics ($1/f^2$ spectrum) and guides the generation away from non-robust adversarial patterns. Our results demonstrate that standard discriminative models encode substantially richer generative knowledge than previously recognized, providing new perspectives on model interpretability and the relationship between adversarial examples and natural image synthesis.

Direct Ascent Synthesis: Revealing Hidden Generative Capabilities in Discriminative Models

TL;DR

Direct Ascent Synthesis (DAS) shows that pretrained discriminative models inherently encode rich generative capabilities. By optimizing across a multi-resolution decomposition of images to maximize CLIP-embedding similarity, DAS achieves training-free, high-quality synthesis that preserves natural image statistics (notably a spectrum) and avoids degenerate adversarial patterns. The method leverages simple priors, augmentations, and CLIP model ensembles to enable tasks such as text-to-image generation, style transfer, and reconstruction from embeddings, challenging the traditional dichotomy between discriminative and generative modeling. The work offers both practical benefits (reduced training and data needs) and theoretical implications for model interpretability, robustness, and the fundamental nature of visual representations.

Abstract

We demonstrate that discriminative models inherently contain powerful generative capabilities, challenging the fundamental distinction between discriminative and generative architectures. Our method, Direct Ascent Synthesis (DAS), reveals these latent capabilities through multi-resolution optimization of CLIP model representations. While traditional inversion attempts produce adversarial patterns, DAS achieves high-quality image synthesis by decomposing optimization across multiple spatial scales (1x1 to 224x224), requiring no additional training. This approach not only enables diverse applications -- from text-to-image generation to style transfer -- but maintains natural image statistics ( spectrum) and guides the generation away from non-robust adversarial patterns. Our results demonstrate that standard discriminative models encode substantially richer generative knowledge than previously recognized, providing new perspectives on model interpretability and the relationship between adversarial examples and natural image synthesis.

Paper Structure

This paper contains 35 sections, 5 equations, 12 figures.

Figures (12)

  • Figure 1: Direct Ascent Synthesis generates high-quality images by optimizing multi-resolution components to match CLIP embeddings, without any generative training. Unlike standard adversarial optimization that produces noise-like patterns, our approach reveals that pretrained discriminative models contain rich generative knowledge accessible through careful optimization. It can be used for a variety of image manipulations, such as style transfer and image reconstruction from a low-dimensional embedding.
  • Figure 2: Multi-resolution decomposition enables training-free image synthesis. Left: An image is expressed as a sum of components at increasing resolutions, from $1\times1$ to $224\times224$. Middle: The components are optimized simultaneously to maximize CLIP embedding similarity with a target description, producing coherent images without generative training. Right: The power spectrum of generated images follows a $1/f^2$ distribution (slope $\approx-2$), characteristic of natural images. This demonstrates that our multi-resolution prior effectively guides optimization toward perceptually valid solutions.
  • Figure 3: Diverse generations from Direct Ascent Synthesis across a range of concepts and styles. Results were obtained by optimizing against an ensemble of three CLIP models, with prompt augmentation to control image aesthetics: discouraging text generation (-0.3 × "Optical Character Recognition"), enhancing rendering quality (0.3 × "octane render, unreal engine, ray tracing, volumetric lighting"), and preventing image stacking (-0.3 × "multiple exposure").
  • Figure 4: Ablation study demonstrating how different components of Direct Ascent Synthesis contribute to coherent image generation. Left: Direct pixel optimization yields adversarial patterns typical of model inversion attacks. Middle: Adding augmentations and model ensembling begins to impose structure but still lacks coherence. Right: Our complete approach with multi-resolution prior produces natural, interpretable images. This progression reveals how careful regularization can transform the degenerate solutions of model inversion into meaningful image synthesis.
  • Figure 5: Mapping between images and embeddings. A region of all images corresponding to a {text, image} embedding contains interpretable images as well as noise-like adversarial patterns. Reconstructing an image from an embedding typically leads to such a degenerate noisy image. With Direct Ascent Synthesis, the reconstructed image lands among interpretable images within the manifold by default.
  • ...and 7 more figures