Causal Disentanglement for Robust Long-tail Medical Image Generation
Weizhi Nie, Zichun Zhang, Weijie Wang, Bruno Lepri, Anan Liu, Nicu Sebe
TL;DR
The paper tackles text-to-3D generation under data scarcity by introducing a novel framework that integrates a 3D shape knowledge graph with a causal feature-selection mechanism (backdoor adjustment) to filter priors. A transformer-based Prior Fusion Module combines shape priors and attribute priors with textual features, feeding a generative network guided by an autoencoder loss and a text-3D alignment loss, while a prior-guided IMLE strategy increases output diversity. Empirical results on Text2Shape show improvements across multiple metrics (eg, IOU, FPD, PS, CLIP R-P) compared with state-of-the-art baselines, and ablations demonstrate the value of causal feature selection and structured priors. The approach advances robust cross-modal 3D generation by leveraging structured knowledge and causal reasoning, with potential to address long-tail and ambiguous text descriptions in 3D synthesis.
Abstract
Counterfactual medical image generation effectively addresses data scarcity and enhances the interpretability of medical images. However, due to the complex and diverse pathological features of medical images and the imbalanced class distribution in medical data, generating high-quality and diverse medical images from limited data is significantly challenging. Additionally, to fully leverage the information in limited data, such as anatomical structure information and generate more structurally stable medical images while avoiding distortion or inconsistency. In this paper, in order to enhance the clinical relevance of generated data and improve the interpretability of the model, we propose a novel medical image generation framework, which generates independent pathological and structural features based on causal disentanglement and utilizes text-guided modeling of pathological features to regulate the generation of counterfactual images. First, we achieve feature separation through causal disentanglement and analyze the interactions between features. Here, we introduce group supervision to ensure the independence of pathological and identity features. Second, we leverage a diffusion model guided by pathological findings to model pathological features, enabling the generation of diverse counterfactual images. Meanwhile, we enhance accuracy by leveraging a large language model to extract lesion severity and location from medical reports. Additionally, we improve the performance of the latent diffusion model on long-tailed categories through initial noise optimization.
