A WDLoRA-Based Multimodal Generative Framework for Clinically Guided Corneal Confocal Microscopy Image Synthesis in Diabetic Neuropathy

Xin Zhang; Liangxiu Han; Yue Shi; Yalin Zheng; Uazman Alam; Maryam Ferdousi; Rayaz Malik

A WDLoRA-Based Multimodal Generative Framework for Clinically Guided Corneal Confocal Microscopy Image Synthesis in Diabetic Neuropathy

Xin Zhang, Liangxiu Han, Yue Shi, Yalin Zheng, Uazman Alam, Maryam Ferdousi, Rayaz Malik

TL;DR

A comprehensive three-pillar evaluation demonstrates that the proposed Weight-Decomposed Low-Rank Adaptation (WDLoRA)-based multimodal generative framework for clinically guided CCM image synthesis achieves state-of-the-art visual fidelity and structural integrity, and has potential to alleviate data bottlenecks in medical AI.

Abstract

Corneal Confocal Microscopy (CCM) is a sensitive tool for assessing small-fiber damage in Diabetic Peripheral Neuropathy (DPN), yet the development of robust, automated deep learning-based diagnostic models is limited by scarce labelled data and fine-grained variability in corneal nerve morphology. Although Artificial Intelligence (AI)-driven foundation generative models excel at natural image synthesis, they often struggle in medical imaging due to limited domain-specific training, compromising the anatomical fidelity required for clinical analysis. To overcome these limitations, we propose a Weight-Decomposed Low-Rank Adaptation (WDLoRA)-based multimodal generative framework for clinically guided CCM image synthesis. WDLoRA is a parameter-efficient fine-tuning (PEFT) mechanism that decouples magnitude and directional weight updates, enabling foundation generative models to independently learn the orientation (nerve topology) and intensity (stromal contrast) required for medical realism. By jointly conditioning on nerve segmentation masks and disease-specific clinical prompts, the model synthesises anatomically coherent images across the DPN spectrum (Control, T1NoDPN, T1DPN). A comprehensive three-pillar evaluation demonstrates that the proposed framework achieves state-of-the-art visual fidelity (Fréchet Inception Distance (FID): 5.18) and structural integrity (Structural Similarity Index Measure (SSIM): 0.630), significantly outperforming GAN and standard diffusion baselines. Crucially, the synthetic images preserve gold-standard clinical biomarkers and are statistically equivalent to real patient data. When used to train automated diagnostic models, the synthetic dataset improves downstream diagnostic accuracy by 2.1% and segmentation performance by 2.2%, validating the framework's potential to alleviate data bottlenecks in medical AI.

A WDLoRA-Based Multimodal Generative Framework for Clinically Guided Corneal Confocal Microscopy Image Synthesis in Diabetic Neuropathy

TL;DR

Abstract

Paper Structure (32 sections, 9 equations, 9 figures, 9 tables)

This paper contains 32 sections, 9 equations, 9 figures, 9 tables.

Introduction
Related Work
Conditional Generative Models for Medical Image Synthesis
Diabetic Peripheral Neuropathy Diagnosis using CCM
Methods
Overview of the Proposed Conditional Multimodal Generative Framework
Multimodal Conditional Image Generation Foundation Model
Multimodal Diffusion Transformer (MMDiT)
Multimodal Encoding and Latent Diffusion Space
Weight-Decomposed Low-Rank Adaptation (WDLoRA)
Experiments and Evaluaion
Dataset and Preprocessing
Image Acquisition
Data Preparation for Generative Modeling
Three-pillar Evaluation Framework
...and 17 more sections

Figures (9)

Figure 1: Evolution of GenAI Models in Medical Imaging. (1) Variational Autoencoders (VAEs) learn a probabilistic latent space but often produce blurry outputs. (2) Generative Adversarial Networks (GANs) use adversarial training for high fidelity but suffer from mode collapse. (3) Denoising Diffusion Probabilistic Models (DDPMs) iteratively denoise data, offering superior stability and mode coverage, making them the state-of-the-art for medical synthesis.
Figure 2: Overview of the proposed generative framework. Building on Qwen-Image-Edit wu2025qwen, the pipeline uses the Qwen2.5-VL encoder to extract semantic features from multimodal inputs (nerve segmentation masks and clinical text prompts). These features condition the Multimodal Diffusion Transformer (MMDiT) backbone. Weight-Decomposed Low-Rank Adaptation (WDLoRA) is used to efficiently fine-tune the MMDiT blocks on CCM data, enabling high-fidelity synthesis while preserving pre-trained knowledge.
Figure 3: Architecture of the Multimodal Diffusion Transformer (MMDiT). The model processes concatenated image and text tokens through a unified self-attention mechanism, enabling bidirectional cross-attention for precise semantic alignment.
Figure 4: Schematic of Weight-Decomposed Low-Rank Adaptation (WDLoRA). The mechanism decomposes weights into magnitude and direction, applying low-rank updates only to the directional component. This structure is applied to the Attention and MLP layers of the MMDiT backbone.
Figure 5: visualisation of the dataset used in this study. The figure displays the diagnostic class labels and the corresponding expert annotations (nerve segmentation masks), which serve as the control images for the generative model.
...and 4 more figures

A WDLoRA-Based Multimodal Generative Framework for Clinically Guided Corneal Confocal Microscopy Image Synthesis in Diabetic Neuropathy

TL;DR

Abstract

A WDLoRA-Based Multimodal Generative Framework for Clinically Guided Corneal Confocal Microscopy Image Synthesis in Diabetic Neuropathy

Authors

TL;DR

Abstract

Table of Contents

Figures (9)