Table of Contents
Fetching ...

Layout-Guided Controllable Pathology Image Generation with In-Context Diffusion Transformers

Yuntao Shou, Xiangyong Cao, Qian Zhao, Deyu Meng

Abstract

Controllable pathology image synthesis requires reliable regulation of spatial layout, tissue morphology, and semantic detail. However, existing text-guided diffusion models offer only coarse global control and lack the ability to enforce fine-grained structural constraints. Progress is further limited by the absence of large datasets that pair patch-level spatial layouts with detailed diagnostic descriptions, since generating such annotations for gigapixel whole-slide images is prohibitively time-consuming for human experts. To overcome these challenges, we first develop a scalable multi-agent LVLM annotation framework that integrates image description, diagnostic step extraction, and automatic quality judgment into a coordinated pipeline, and we evaluate the reliability of the system through a human verification process. This framework enables efficient construction of fine-grained and clinically aligned supervision at scale. Building on the curated data, we propose In-Context Diffusion Transformer (IC-DiT), a layout-aware generative model that incorporates spatial layouts, textual descriptions, and visual embeddings into a unified diffusion transformer. Through hierarchical multimodal attention, IC-DiT maintains global semantic coherence while accurately preserving structural and morphological details. Extensive experiments on five histopathology datasets show that IC-DiT achieves higher fidelity, stronger spatial controllability, and better diagnostic consistency than existing methods. In addition, the generated images serve as effective data augmentation resources for downstream tasks such as cancer classification and survival analysis.

Layout-Guided Controllable Pathology Image Generation with In-Context Diffusion Transformers

Abstract

Controllable pathology image synthesis requires reliable regulation of spatial layout, tissue morphology, and semantic detail. However, existing text-guided diffusion models offer only coarse global control and lack the ability to enforce fine-grained structural constraints. Progress is further limited by the absence of large datasets that pair patch-level spatial layouts with detailed diagnostic descriptions, since generating such annotations for gigapixel whole-slide images is prohibitively time-consuming for human experts. To overcome these challenges, we first develop a scalable multi-agent LVLM annotation framework that integrates image description, diagnostic step extraction, and automatic quality judgment into a coordinated pipeline, and we evaluate the reliability of the system through a human verification process. This framework enables efficient construction of fine-grained and clinically aligned supervision at scale. Building on the curated data, we propose In-Context Diffusion Transformer (IC-DiT), a layout-aware generative model that incorporates spatial layouts, textual descriptions, and visual embeddings into a unified diffusion transformer. Through hierarchical multimodal attention, IC-DiT maintains global semantic coherence while accurately preserving structural and morphological details. Extensive experiments on five histopathology datasets show that IC-DiT achieves higher fidelity, stronger spatial controllability, and better diagnostic consistency than existing methods. In addition, the generated images serve as effective data augmentation resources for downstream tasks such as cancer classification and survival analysis.
Paper Structure (17 sections, 9 equations, 5 figures, 6 tables)

This paper contains 17 sections, 9 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Comparison between traditional diffusion-based pathology synthesis and the proposed IC-DiT. (a) Conventional latent diffusion models rely on global cross-attention, offering limited control over spatial structure and semantics. (b) IC-DiT integrates multi-modal in-context conditioning with image, layout, visual, and text tokens, enabling anatomically coherent and spatially controllable pathology generation.
  • Figure 2: Overview of the multi-agent framework for pathological image description generation. It includes: (1) Pathological Feature Extraction, where vision-language models convert images into visual-textual descriptions; (2) Key Step Extraction, where language models break down diagnostic reasoning into structured steps; and (3) Human Verification, which uses automated scores and expert feedback for accuracy assessment. A Judge Agent quantifies reliability through recognition and coherence metrics.
  • Figure 3: Overview of the In-Context Diffusion Transformer (IC-DiT) architecture. This model integrates text, image, layout, and embedding tokens through separate encoders (T5, VAE, iBOT) and processes them with multi-modal attention (MM-Attention) mechanisms. The attention mechanism is applied across different modalities to fuse them in a shared latent space, allowing for high-fidelity pathology image generation based on semantic descriptions and spatial layouts.
  • Figure 4: Qualitative comparison of layout-guided pathology image generation. Each column shows the input caption, corresponding mask, real image, and generated image.
  • Figure 5: Ablation study on layout control. Comparison between real pathology patches and images generated without layout-mask constraints.