Layout-Guided Controllable Pathology Image Generation with In-Context Diffusion Transformers

Yuntao Shou; Xiangyong Cao; Qian Zhao; Deyu Meng

Layout-Guided Controllable Pathology Image Generation with In-Context Diffusion Transformers

Yuntao Shou, Xiangyong Cao, Qian Zhao, Deyu Meng

Abstract

Controllable pathology image synthesis requires reliable regulation of spatial layout, tissue morphology, and semantic detail. However, existing text-guided diffusion models offer only coarse global control and lack the ability to enforce fine-grained structural constraints. Progress is further limited by the absence of large datasets that pair patch-level spatial layouts with detailed diagnostic descriptions, since generating such annotations for gigapixel whole-slide images is prohibitively time-consuming for human experts. To overcome these challenges, we first develop a scalable multi-agent LVLM annotation framework that integrates image description, diagnostic step extraction, and automatic quality judgment into a coordinated pipeline, and we evaluate the reliability of the system through a human verification process. This framework enables efficient construction of fine-grained and clinically aligned supervision at scale. Building on the curated data, we propose In-Context Diffusion Transformer (IC-DiT), a layout-aware generative model that incorporates spatial layouts, textual descriptions, and visual embeddings into a unified diffusion transformer. Through hierarchical multimodal attention, IC-DiT maintains global semantic coherence while accurately preserving structural and morphological details. Extensive experiments on five histopathology datasets show that IC-DiT achieves higher fidelity, stronger spatial controllability, and better diagnostic consistency than existing methods. In addition, the generated images serve as effective data augmentation resources for downstream tasks such as cancer classification and survival analysis.

Layout-Guided Controllable Pathology Image Generation with In-Context Diffusion Transformers

Abstract

Paper Structure (17 sections, 9 equations, 5 figures, 6 tables)

This paper contains 17 sections, 9 equations, 5 figures, 6 tables.

Introduction
Related Work
Diffusion Models for Digital Pathology
Diffusion Transformer
Layout-to-Image Generation
Proposed Method
Framework Overview
Pathology Image-Text Pair Construction
Layout Representation Generation
Layout-to-Image Generation
Semantic, Spatial, and Visual Integration
Training and Inference.
Experiments
Datasets and Evaluation Metrics
Pathological Image Generation
...and 2 more sections

Figures (5)

Figure 1: Comparison between traditional diffusion-based pathology synthesis and the proposed IC-DiT. (a) Conventional latent diffusion models rely on global cross-attention, offering limited control over spatial structure and semantics. (b) IC-DiT integrates multi-modal in-context conditioning with image, layout, visual, and text tokens, enabling anatomically coherent and spatially controllable pathology generation.
Figure 2: Overview of the multi-agent framework for pathological image description generation. It includes: (1) Pathological Feature Extraction, where vision-language models convert images into visual-textual descriptions; (2) Key Step Extraction, where language models break down diagnostic reasoning into structured steps; and (3) Human Verification, which uses automated scores and expert feedback for accuracy assessment. A Judge Agent quantifies reliability through recognition and coherence metrics.
Figure 3: Overview of the In-Context Diffusion Transformer (IC-DiT) architecture. This model integrates text, image, layout, and embedding tokens through separate encoders (T5, VAE, iBOT) and processes them with multi-modal attention (MM-Attention) mechanisms. The attention mechanism is applied across different modalities to fuse them in a shared latent space, allowing for high-fidelity pathology image generation based on semantic descriptions and spatial layouts.
Figure 4: Qualitative comparison of layout-guided pathology image generation. Each column shows the input caption, corresponding mask, real image, and generated image.
Figure 5: Ablation study on layout control. Comparison between real pathology patches and images generated without layout-mask constraints.

Layout-Guided Controllable Pathology Image Generation with In-Context Diffusion Transformers

Abstract

Layout-Guided Controllable Pathology Image Generation with In-Context Diffusion Transformers

Authors

Abstract

Table of Contents

Figures (5)