Table of Contents
Fetching ...

DiA-gnostic VLVAE: Disentangled Alignment-Constrained Vision Language Variational AutoEncoder for Robust Radiology Reporting with Missing Modalities

Nagur Shareef Shaik, Teja Krishna Cherukuri, Adnan Masood, Dong Hye Ye

TL;DR

The DiA-gnostic VLVAE is proposed, which achieves robust radiology reporting through Disentangled Alignment by disentangling shared and modality-specific features using a Mixture-of-Experts based Vision-Language Variational Autoencoder (VLVAE).

Abstract

The integration of medical images with clinical context is essential for generating accurate and clinically interpretable radiology reports. However, current automated methods often rely on resource-heavy Large Language Models (LLMs) or static knowledge graphs and struggle with two fundamental challenges in real-world clinical data: (1) missing modalities, such as incomplete clinical context , and (2) feature entanglement, where mixed modality-specific and shared information leads to suboptimal fusion and clinically unfaithful hallucinated findings. To address these challenges, we propose the DiA-gnostic VLVAE, which achieves robust radiology reporting through Disentangled Alignment. Our framework is designed to be resilient to missing modalities by disentangling shared and modality-specific features using a Mixture-of-Experts (MoE) based Vision-Language Variational Autoencoder (VLVAE). A constrained optimization objective enforces orthogonality and alignment between these latent representations to prevent suboptimal fusion. A compact LLaMA-X decoder then uses these disentangled representations to generate reports efficiently. On the IU X-Ray and MIMIC-CXR datasets, DiA has achieved competetive BLEU@4 scores of 0.266 and 0.134, respectively. Experimental results show that the proposed method significantly outperforms state-of-the-art models.

DiA-gnostic VLVAE: Disentangled Alignment-Constrained Vision Language Variational AutoEncoder for Robust Radiology Reporting with Missing Modalities

TL;DR

The DiA-gnostic VLVAE is proposed, which achieves robust radiology reporting through Disentangled Alignment by disentangling shared and modality-specific features using a Mixture-of-Experts based Vision-Language Variational Autoencoder (VLVAE).

Abstract

The integration of medical images with clinical context is essential for generating accurate and clinically interpretable radiology reports. However, current automated methods often rely on resource-heavy Large Language Models (LLMs) or static knowledge graphs and struggle with two fundamental challenges in real-world clinical data: (1) missing modalities, such as incomplete clinical context , and (2) feature entanglement, where mixed modality-specific and shared information leads to suboptimal fusion and clinically unfaithful hallucinated findings. To address these challenges, we propose the DiA-gnostic VLVAE, which achieves robust radiology reporting through Disentangled Alignment. Our framework is designed to be resilient to missing modalities by disentangling shared and modality-specific features using a Mixture-of-Experts (MoE) based Vision-Language Variational Autoencoder (VLVAE). A constrained optimization objective enforces orthogonality and alignment between these latent representations to prevent suboptimal fusion. A compact LLaMA-X decoder then uses these disentangled representations to generate reports efficiently. On the IU X-Ray and MIMIC-CXR datasets, DiA has achieved competetive BLEU@4 scores of 0.266 and 0.134, respectively. Experimental results show that the proposed method significantly outperforms state-of-the-art models.

Paper Structure

This paper contains 42 sections, 24 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Architecture of DiA: Extracts vision features using EfficientNetB$_0$ with Guided Context Attention and language features via a Transformer Encoder, fused by a Modality Abstractor; learns modality-specific latents ($Z_v, Z_l$) using VAEs (VGG16 and Transformer) and shared latent ($Z_s$) through a Mixture-of-Experts Shared Encoder, disentangled via $\mathcal{L_\text{orth}}$, aligned with $\mathcal{L_\text{align}}$; generate reports using LlaMA-X Decoder.
  • Figure 2: Comparison of actual and generated reports with chest X-rays and attention maps. Purple highlights key findings in the actual report, green indicates matched findings in the report, and amber marks mismatches / additional generated findings.
  • Figure 3: t-SNE projections of latent variables for IU X-Ray. Each subfigure shows distributions of language-specific ($Z_l$, blue), vision-specific ($Z_v$, red), and shared ($Z_s$, green) representations under four settings: (a) Base VLVAE, (b) with $\mathcal{L}_{\text{orth}}$, (c) with $\mathcal{L}_{\text{align}}$, and (d) with both constraints.
  • Figure 4: t-SNE projections of latent variables for MIMIC-CXR under the same settings as in Fig. \ref{['fig:tsne-iux']}. The plots illustrate how the latent space evolves across training objectives and datasets.