ZoomLDM: Latent Diffusion Model for multi-scale image generation

Srikar Yellapragada; Alexandros Graikos; Kostas Triaridis; Prateek Prasanna; Rajarsi R. Gupta; Joel Saltz; Dimitris Samaras

ZoomLDM: Latent Diffusion Model for multi-scale image generation

Srikar Yellapragada, Alexandros Graikos, Kostas Triaridis, Prateek Prasanna, Rajarsi R. Gupta, Joel Saltz, Dimitris Samaras

TL;DR

ZoomLDM tackles the challenge of generating large, gigapixel-scale images by learning a unified, multi-scale diffusion process conditioned on scale and SSL-derived features. It introduces a cross-magnification latent space via a trainable Summarizer and a Conditioning Diffusion Model to enable magnification-aware sampling, along with a joint multi-scale sampling strategy that yields globally coherent images up to $4096 \times 4096$ while preserving local detail. The approach achieves state-of-the-art synthesis quality across scales, enables effective 4× super-resolution, and provides rich multi-scale features that improve multiple instance learning in histopathology; its satellite-data experiments further demonstrate broad applicability. Overall, ZoomLDM provides a practical, data-efficient path toward foundation-like generative capabilities for large-domain imaging, with strong potential for downstream tasks and dataset generation.

Abstract

Diffusion models have revolutionized image generation, yet several challenges restrict their application to large-image domains, such as digital pathology and satellite imagery. Given that it is infeasible to directly train a model on 'whole' images from domains with potential gigapixel sizes, diffusion-based generative methods have focused on synthesizing small, fixed-size patches extracted from these images. However, generating small patches has limited applicability since patch-based models fail to capture the global structures and wider context of large images, which can be crucial for synthesizing (semantically) accurate samples. To overcome this limitation, we present ZoomLDM, a diffusion model tailored for generating images across multiple scales. Central to our approach is a novel magnification-aware conditioning mechanism that utilizes self-supervised learning (SSL) embeddings and allows the diffusion model to synthesize images at different 'zoom' levels, i.e., fixed-size patches extracted from large images at varying scales. ZoomLDM synthesizes coherent histopathology images that remain contextually accurate and detailed at different zoom levels, achieving state-of-the-art image generation quality across all scales and excelling in the data-scarce setting of generating thumbnails of entire large images. The multi-scale nature of ZoomLDM unlocks additional capabilities in large image generation, enabling computationally tractable and globally coherent image synthesis up to $4096 \times 4096$ pixels and $4\times$ super-resolution. Additionally, multi-scale features extracted from ZoomLDM are highly effective in multiple instance learning experiments.

ZoomLDM: Latent Diffusion Model for multi-scale image generation

TL;DR

Abstract

ZoomLDM: Latent Diffusion Model for multi-scale image generation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (16)