Table of Contents
Fetching ...

ZoomLDM: Latent Diffusion Model for multi-scale image generation

Srikar Yellapragada, Alexandros Graikos, Kostas Triaridis, Prateek Prasanna, Rajarsi R. Gupta, Joel Saltz, Dimitris Samaras

TL;DR

ZoomLDM tackles the challenge of generating large, gigapixel-scale images by learning a unified, multi-scale diffusion process conditioned on scale and SSL-derived features. It introduces a cross-magnification latent space via a trainable Summarizer and a Conditioning Diffusion Model to enable magnification-aware sampling, along with a joint multi-scale sampling strategy that yields globally coherent images up to $4096 \times 4096$ while preserving local detail. The approach achieves state-of-the-art synthesis quality across scales, enables effective 4× super-resolution, and provides rich multi-scale features that improve multiple instance learning in histopathology; its satellite-data experiments further demonstrate broad applicability. Overall, ZoomLDM provides a practical, data-efficient path toward foundation-like generative capabilities for large-domain imaging, with strong potential for downstream tasks and dataset generation.

Abstract

Diffusion models have revolutionized image generation, yet several challenges restrict their application to large-image domains, such as digital pathology and satellite imagery. Given that it is infeasible to directly train a model on 'whole' images from domains with potential gigapixel sizes, diffusion-based generative methods have focused on synthesizing small, fixed-size patches extracted from these images. However, generating small patches has limited applicability since patch-based models fail to capture the global structures and wider context of large images, which can be crucial for synthesizing (semantically) accurate samples. To overcome this limitation, we present ZoomLDM, a diffusion model tailored for generating images across multiple scales. Central to our approach is a novel magnification-aware conditioning mechanism that utilizes self-supervised learning (SSL) embeddings and allows the diffusion model to synthesize images at different 'zoom' levels, i.e., fixed-size patches extracted from large images at varying scales. ZoomLDM synthesizes coherent histopathology images that remain contextually accurate and detailed at different zoom levels, achieving state-of-the-art image generation quality across all scales and excelling in the data-scarce setting of generating thumbnails of entire large images. The multi-scale nature of ZoomLDM unlocks additional capabilities in large image generation, enabling computationally tractable and globally coherent image synthesis up to $4096 \times 4096$ pixels and $4\times$ super-resolution. Additionally, multi-scale features extracted from ZoomLDM are highly effective in multiple instance learning experiments.

ZoomLDM: Latent Diffusion Model for multi-scale image generation

TL;DR

ZoomLDM tackles the challenge of generating large, gigapixel-scale images by learning a unified, multi-scale diffusion process conditioned on scale and SSL-derived features. It introduces a cross-magnification latent space via a trainable Summarizer and a Conditioning Diffusion Model to enable magnification-aware sampling, along with a joint multi-scale sampling strategy that yields globally coherent images up to while preserving local detail. The approach achieves state-of-the-art synthesis quality across scales, enables effective 4× super-resolution, and provides rich multi-scale features that improve multiple instance learning in histopathology; its satellite-data experiments further demonstrate broad applicability. Overall, ZoomLDM provides a practical, data-efficient path toward foundation-like generative capabilities for large-domain imaging, with strong potential for downstream tasks and dataset generation.

Abstract

Diffusion models have revolutionized image generation, yet several challenges restrict their application to large-image domains, such as digital pathology and satellite imagery. Given that it is infeasible to directly train a model on 'whole' images from domains with potential gigapixel sizes, diffusion-based generative methods have focused on synthesizing small, fixed-size patches extracted from these images. However, generating small patches has limited applicability since patch-based models fail to capture the global structures and wider context of large images, which can be crucial for synthesizing (semantically) accurate samples. To overcome this limitation, we present ZoomLDM, a diffusion model tailored for generating images across multiple scales. Central to our approach is a novel magnification-aware conditioning mechanism that utilizes self-supervised learning (SSL) embeddings and allows the diffusion model to synthesize images at different 'zoom' levels, i.e., fixed-size patches extracted from large images at varying scales. ZoomLDM synthesizes coherent histopathology images that remain contextually accurate and detailed at different zoom levels, achieving state-of-the-art image generation quality across all scales and excelling in the data-scarce setting of generating thumbnails of entire large images. The multi-scale nature of ZoomLDM unlocks additional capabilities in large image generation, enabling computationally tractable and globally coherent image synthesis up to pixels and super-resolution. Additionally, multi-scale features extracted from ZoomLDM are highly effective in multiple instance learning experiments.

Paper Structure

This paper contains 28 sections, 13 equations, 16 figures, 9 tables, 2 algorithms.

Figures (16)

  • Figure 1: ZoomLDM can generate synthetic image patches at multiple scales (left). It can generate large images that preserve spatial context (center) and perform super-resolution (right), without any additional training. Large images from prior work le2024inftybrushcontrollablelargeimagegraikos2024learned suffer from blurriness and lack of global context.
  • Figure 2: Overview of our approach. Left: We extract $256 \times 256$ patches from large images at the initial scale ($20\times$ for pathology) and generate SSL embedding matrices using pretrained encoders. The large image is then progressively downsampled by a factor of 2, with patches at each scale paired with the SSL embeddings of all overlapping initial-scale patches. Right: The SSL embeddings and magnification level are fed to the Summarizer, which projects them into the cross-magnification Latent space. The diffusion model is trained to generate $256 \times 256$ patches conditioned on the Summarizer's output.
  • Figure 3: Large Images ($4096 \times 4096$) generated from ZoomLDM. Our large image generation framework is the first to generate 4k pathology images with local details and global consistency, all within reasonable inference time. We provide more 4k examples and comparisons in the supplementary.
  • Figure 4: We showcase $4 \times$ super-resolution results ($256 \times 256 \rightarrow 1024 \times 1024$). Samples generated by other methods rombach2022highzhang2023adding exhibit artifacts, inconsistencies, and blurriness that are not present in our outputs. Specifically, in blue boxes, we can observe that CompVisrombach2022high generates fine scale artifacts, while ControlNetzhang2023adding produces generally blurry outputs. ZoomLDM produces a sharp output, generating details generally consistent with the ground truth image.
  • Figure 5: Overview of the Summarizer and Condition Diffusion Model.
  • ...and 11 more figures