Table of Contents
Fetching ...

Learned representation-guided diffusion models for large-image generation

Alexandros Graikos, Srikar Yellapragada, Minh-Quan Le, Saarthak Kapse, Prateek Prasanna, Joel Saltz, Dimitris Samaras

TL;DR

This work introduces learned representation-guided diffusion, conditioning diffusion models on self-supervised embeddings to synthesize high-quality histopathology and remote-sensing images. It extends patch-level diffusion to large images by arranging SSL-conditioned patches on a grid and merging their updates, enabling controllable, text-to-large-image synthesis via auxiliary mappings $p(z|c)$. The approach achieves strong patch- and large-image fidelity (as shown by FID, CLIP-FID, and embedding similarity), enables effective data augmentation for downstream tasks, and demonstrates generalization across datasets and modalities. By reducing annotation costs and enabling domain-specific foundational models, the method holds practical potential for scalable synthesis and analysis in specialized imaging domains.

Abstract

To synthesize high-fidelity samples, diffusion models typically require auxiliary data to guide the generation process. However, it is impractical to procure the painstaking patch-level annotation effort required in specialized domains like histopathology and satellite imagery; it is often performed by domain experts and involves hundreds of millions of patches. Modern-day self-supervised learning (SSL) representations encode rich semantic and visual information. In this paper, we posit that such representations are expressive enough to act as proxies to fine-grained human labels. We introduce a novel approach that trains diffusion models conditioned on embeddings from SSL. Our diffusion models successfully project these features back to high-quality histopathology and remote sensing images. In addition, we construct larger images by assembling spatially consistent patches inferred from SSL embeddings, preserving long-range dependencies. Augmenting real data by generating variations of real images improves downstream classifier accuracy for patch-level and larger, image-scale classification tasks. Our models are effective even on datasets not encountered during training, demonstrating their robustness and generalizability. Generating images from learned embeddings is agnostic to the source of the embeddings. The SSL embeddings used to generate a large image can either be extracted from a reference image, or sampled from an auxiliary model conditioned on any related modality (e.g. class labels, text, genomic data). As proof of concept, we introduce the text-to-large image synthesis paradigm where we successfully synthesize large pathology and satellite images out of text descriptions.

Learned representation-guided diffusion models for large-image generation

TL;DR

This work introduces learned representation-guided diffusion, conditioning diffusion models on self-supervised embeddings to synthesize high-quality histopathology and remote-sensing images. It extends patch-level diffusion to large images by arranging SSL-conditioned patches on a grid and merging their updates, enabling controllable, text-to-large-image synthesis via auxiliary mappings . The approach achieves strong patch- and large-image fidelity (as shown by FID, CLIP-FID, and embedding similarity), enables effective data augmentation for downstream tasks, and demonstrates generalization across datasets and modalities. By reducing annotation costs and enabling domain-specific foundational models, the method holds practical potential for scalable synthesis and analysis in specialized imaging domains.

Abstract

To synthesize high-fidelity samples, diffusion models typically require auxiliary data to guide the generation process. However, it is impractical to procure the painstaking patch-level annotation effort required in specialized domains like histopathology and satellite imagery; it is often performed by domain experts and involves hundreds of millions of patches. Modern-day self-supervised learning (SSL) representations encode rich semantic and visual information. In this paper, we posit that such representations are expressive enough to act as proxies to fine-grained human labels. We introduce a novel approach that trains diffusion models conditioned on embeddings from SSL. Our diffusion models successfully project these features back to high-quality histopathology and remote sensing images. In addition, we construct larger images by assembling spatially consistent patches inferred from SSL embeddings, preserving long-range dependencies. Augmenting real data by generating variations of real images improves downstream classifier accuracy for patch-level and larger, image-scale classification tasks. Our models are effective even on datasets not encountered during training, demonstrating their robustness and generalizability. Generating images from learned embeddings is agnostic to the source of the embeddings. The SSL embeddings used to generate a large image can either be extracted from a reference image, or sampled from an auxiliary model conditioned on any related modality (e.g. class labels, text, genomic data). As proof of concept, we introduce the text-to-large image synthesis paradigm where we successfully synthesize large pathology and satellite images out of text descriptions.
Paper Structure (24 sections, 6 equations, 12 figures, 8 tables)

This paper contains 24 sections, 6 equations, 12 figures, 8 tables.

Figures (12)

  • Figure 1: We propose using SSL features to condition diffusion models. This allows us to construct large images by assembling consistent patches inferred from a spatial arrangement of SSL embeddings. The generated image retains the semantics of the embeddings used as a condition, maintaining the forested and open areas from the reference. Best viewed zoomed-in.
  • Figure 2: (a) We train diffusion models on patches $I$ (e.g. the one in the green box) taken from a large image conditioned on SSL embeddings. (b) We present our large image generation framework in 4 steps: (i) We extract a set of spatially arranged embeddings from a reference image or sample them from an auxiliary model. (ii) For every location $(i,j)$, we compute a conditioning vector $\lambda_{i,j}$ by interpolating the spatial grid of embeddings. (iii) At every diffusion step, we denoise the patch $F(i,j)$ using the conditioning $\lambda_{i,j}$. (iv) The next step is computed by averaging the denoising updates of all patches that overlap at $(i,j)$.
  • Figure 3: (Top) Patches (256 $\times$ 256) from our models, and the corresponding reference real patches used to generate them. The SSL-guided LDM replicates the semantics of the reference patch. (Bottom) Large images (1024 $\times$ 1024) from our models, and the corresponding reference real images used to generate them. We preserve the global arrangement of the semantics defined in the reference image.
  • Figure 4: Confusion matrix of zero-shot classification for novel TCGA-CRC and TCGA-BRCA synthetic images.
  • Figure 5: Examples of patches annotated by an expert pathologist. For each image, the pathologist required 5-10s to provide a brief, detailed description of the features visible. Annotating the entirety of TCGA in this manner is a colossal task.
  • ...and 7 more figures