Table of Contents
Fetching ...

Causal Disentanglement for Robust Long-tail Medical Image Generation

Weizhi Nie, Zichun Zhang, Weijie Wang, Bruno Lepri, Anan Liu, Nicu Sebe

TL;DR

The paper tackles text-to-3D generation under data scarcity by introducing a novel framework that integrates a 3D shape knowledge graph with a causal feature-selection mechanism (backdoor adjustment) to filter priors. A transformer-based Prior Fusion Module combines shape priors and attribute priors with textual features, feeding a generative network guided by an autoencoder loss and a text-3D alignment loss, while a prior-guided IMLE strategy increases output diversity. Empirical results on Text2Shape show improvements across multiple metrics (eg, IOU, FPD, PS, CLIP R-P) compared with state-of-the-art baselines, and ablations demonstrate the value of causal feature selection and structured priors. The approach advances robust cross-modal 3D generation by leveraging structured knowledge and causal reasoning, with potential to address long-tail and ambiguous text descriptions in 3D synthesis.

Abstract

Counterfactual medical image generation effectively addresses data scarcity and enhances the interpretability of medical images. However, due to the complex and diverse pathological features of medical images and the imbalanced class distribution in medical data, generating high-quality and diverse medical images from limited data is significantly challenging. Additionally, to fully leverage the information in limited data, such as anatomical structure information and generate more structurally stable medical images while avoiding distortion or inconsistency. In this paper, in order to enhance the clinical relevance of generated data and improve the interpretability of the model, we propose a novel medical image generation framework, which generates independent pathological and structural features based on causal disentanglement and utilizes text-guided modeling of pathological features to regulate the generation of counterfactual images. First, we achieve feature separation through causal disentanglement and analyze the interactions between features. Here, we introduce group supervision to ensure the independence of pathological and identity features. Second, we leverage a diffusion model guided by pathological findings to model pathological features, enabling the generation of diverse counterfactual images. Meanwhile, we enhance accuracy by leveraging a large language model to extract lesion severity and location from medical reports. Additionally, we improve the performance of the latent diffusion model on long-tailed categories through initial noise optimization.

Causal Disentanglement for Robust Long-tail Medical Image Generation

TL;DR

The paper tackles text-to-3D generation under data scarcity by introducing a novel framework that integrates a 3D shape knowledge graph with a causal feature-selection mechanism (backdoor adjustment) to filter priors. A transformer-based Prior Fusion Module combines shape priors and attribute priors with textual features, feeding a generative network guided by an autoencoder loss and a text-3D alignment loss, while a prior-guided IMLE strategy increases output diversity. Empirical results on Text2Shape show improvements across multiple metrics (eg, IOU, FPD, PS, CLIP R-P) compared with state-of-the-art baselines, and ablations demonstrate the value of causal feature selection and structured priors. The approach advances robust cross-modal 3D generation by leveraging structured knowledge and causal reasoning, with potential to address long-tail and ambiguous text descriptions in 3D synthesis.

Abstract

Counterfactual medical image generation effectively addresses data scarcity and enhances the interpretability of medical images. However, due to the complex and diverse pathological features of medical images and the imbalanced class distribution in medical data, generating high-quality and diverse medical images from limited data is significantly challenging. Additionally, to fully leverage the information in limited data, such as anatomical structure information and generate more structurally stable medical images while avoiding distortion or inconsistency. In this paper, in order to enhance the clinical relevance of generated data and improve the interpretability of the model, we propose a novel medical image generation framework, which generates independent pathological and structural features based on causal disentanglement and utilizes text-guided modeling of pathological features to regulate the generation of counterfactual images. First, we achieve feature separation through causal disentanglement and analyze the interactions between features. Here, we introduce group supervision to ensure the independence of pathological and identity features. Second, we leverage a diffusion model guided by pathological findings to model pathological features, enabling the generation of diverse counterfactual images. Meanwhile, we enhance accuracy by leveraging a large language model to extract lesion severity and location from medical reports. Additionally, we improve the performance of the latent diffusion model on long-tailed categories through initial noise optimization.

Paper Structure

This paper contains 36 sections, 17 equations, 13 figures, 6 tables, 1 algorithm.

Figures (13)

  • Figure 1: a) A single caption can only describe part of the appearance of a 3D object, and ambiguous descriptions may cause difficulties for text-3D works. b) Inspired by the human thinking mode, we think the two types of prior knowledge (semantic attributes and related shapes) can be used to provide more detailed information and enhance the text-3D generation task.
  • Figure 2: The overall framework of T2TD mainly includes three parts: a) A pre-trained representation module, which learns the 3D geometric information through an autoencoder and learns text-3D joint representations through cross-modal contrastive learning. b) Constructing the text-3D knowledge graph to structurally associate the texts and 3D shapes, which is used to provide prior information for the generative network. c) A text-3D generation network to leverage text input and retrieve prior knowledge to generate 3D shapes.
  • Figure 3: The basic architecture of the encoder networks. (a) The transformer-based text encoder, it converts the input text description into a global sentence feature. (b)The CNN-based 3D shape encoder, it converts the colored 3D volume into a global 3D feature. (c)The implicit shape decoder, it takes a 3D shape feature with a point coordinate as input and predicts the occupancy probability or the RGB value of each sampled position.
  • Figure 4: Overview of the framework implementation, which mainly consists of four parts: a) Constructing the knowledge graph by defining the entities and relations in the graph. b) Retrieve two types of prior knowledge and the extracted features by the proposed encoders. c) The training process of the text-3D generative network, which mainly aims to reduce the gap between text and 3D modalities by introducing prior knowledge. d)To further diversify the generation results, adapted to our methods, we propose a prior guided IMLE to fully utilize the prior knowledge.
  • Figure 5: Backdoor Adjustment. The direct edges represent the causalities between two variables.
  • ...and 8 more figures