Table of Contents
Fetching ...

Causal-SAM-LLM: Large Language Models as Causal Reasoners for Robust Medical Segmentation

Tao Tang, Shijie Xu, Jionglong Su, Zhixiang Lu

TL;DR

The paper tackles the generalization gap in medical image segmentation caused by domain-style confounds. It introduces Causal-SAM-LLM, which freezes a Segment Anything Model encoder and adds two causal mechanisms: Linguistic Adversarial Disentanglement (LAD) to purge style-related information from features, and Test-Time Causal Intervention (TCI) where an LLM modulates the decoder via FiLM in response to natural-language prompts. LAD leverages a Vision-Language Model to generate detailed style descriptions and trains a contrastive objective with a CLIP-based embedding to enforce semantic disentanglement. On a composite benchmark spanning cross-scanner, cross-modality, and cross-anatomy shifts (BTCV, CHAOS, AMOS, BraTS), Causal-SAM-LLM achieves state-of-the-art OOD robustness, boosting average Dice by up to 6.2 points and reducing HD by up to 15.8 mm while using under 9% of the full model's trainable parameters, and enabling practical, interactive error correction through language prompts.

Abstract

The clinical utility of deep learning models for medical image segmentation is severely constrained by their inability to generalize to unseen domains. This failure is often rooted in the models learning spurious correlations between anatomical content and domain-specific imaging styles. To overcome this fundamental challenge, we introduce Causal-SAM-LLM, a novel framework that elevates Large Language Models (LLMs) to the role of causal reasoners. Our framework, built upon a frozen Segment Anything Model (SAM) encoder, incorporates two synergistic innovations. First, Linguistic Adversarial Disentanglement (LAD) employs a Vision-Language Model to generate rich, textual descriptions of confounding image styles. By training the segmentation model's features to be contrastively dissimilar to these style descriptions, it learns a representation robustly purged of non-causal information. Second, Test-Time Causal Intervention (TCI) provides an interactive mechanism where an LLM interprets a clinician's natural language command to modulate the segmentation decoder's features in real-time, enabling targeted error correction. We conduct an extensive empirical evaluation on a composite benchmark from four public datasets (BTCV, CHAOS, AMOS, BraTS), assessing generalization under cross-scanner, cross-modality, and cross-anatomy settings. Causal-SAM-LLM establishes a new state of the art in out-of-distribution (OOD) robustness, improving the average Dice score by up to 6.2 points and reducing the Hausdorff Distance by 15.8 mm over the strongest baseline, all while using less than 9% of the full model's trainable parameters. Our work charts a new course for building robust, efficient, and interactively controllable medical AI systems.

Causal-SAM-LLM: Large Language Models as Causal Reasoners for Robust Medical Segmentation

TL;DR

The paper tackles the generalization gap in medical image segmentation caused by domain-style confounds. It introduces Causal-SAM-LLM, which freezes a Segment Anything Model encoder and adds two causal mechanisms: Linguistic Adversarial Disentanglement (LAD) to purge style-related information from features, and Test-Time Causal Intervention (TCI) where an LLM modulates the decoder via FiLM in response to natural-language prompts. LAD leverages a Vision-Language Model to generate detailed style descriptions and trains a contrastive objective with a CLIP-based embedding to enforce semantic disentanglement. On a composite benchmark spanning cross-scanner, cross-modality, and cross-anatomy shifts (BTCV, CHAOS, AMOS, BraTS), Causal-SAM-LLM achieves state-of-the-art OOD robustness, boosting average Dice by up to 6.2 points and reducing HD by up to 15.8 mm while using under 9% of the full model's trainable parameters, and enabling practical, interactive error correction through language prompts.

Abstract

The clinical utility of deep learning models for medical image segmentation is severely constrained by their inability to generalize to unseen domains. This failure is often rooted in the models learning spurious correlations between anatomical content and domain-specific imaging styles. To overcome this fundamental challenge, we introduce Causal-SAM-LLM, a novel framework that elevates Large Language Models (LLMs) to the role of causal reasoners. Our framework, built upon a frozen Segment Anything Model (SAM) encoder, incorporates two synergistic innovations. First, Linguistic Adversarial Disentanglement (LAD) employs a Vision-Language Model to generate rich, textual descriptions of confounding image styles. By training the segmentation model's features to be contrastively dissimilar to these style descriptions, it learns a representation robustly purged of non-causal information. Second, Test-Time Causal Intervention (TCI) provides an interactive mechanism where an LLM interprets a clinician's natural language command to modulate the segmentation decoder's features in real-time, enabling targeted error correction. We conduct an extensive empirical evaluation on a composite benchmark from four public datasets (BTCV, CHAOS, AMOS, BraTS), assessing generalization under cross-scanner, cross-modality, and cross-anatomy settings. Causal-SAM-LLM establishes a new state of the art in out-of-distribution (OOD) robustness, improving the average Dice score by up to 6.2 points and reducing the Hausdorff Distance by 15.8 mm over the strongest baseline, all while using less than 9% of the full model's trainable parameters. Our work charts a new course for building robust, efficient, and interactively controllable medical AI systems.

Paper Structure

This paper contains 22 sections, 15 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: The Causal-SAM-LLM Framework.(Top) Training Phase: A frozen SAM encoder provides features $\mathbf{f}$. A content head $\mathcal{H}_c$ is trained for segmentation ($\mathcal{L}_{seg}$). Simultaneously, a frozen VLM acts as a linguistic adversary, generating a rich text description of the image's style, $t_{style}$. A contrastive disentanglement loss ($\mathcal{L}_{dis}$) pushes the image features $\mathbf{f}$ away from the text embedding of $t_{style}$, forcing the model to learn style-invariant representations. (Bottom) Inference Phase: At test time, a user provides a natural language prompt $p_{user}$ describing a segmentation error. A Causal Reasoner (LLM) interprets the prompt and predicts modulation parameters $(\boldsymbol{\gamma}, \boldsymbol{\beta})$ for FiLM layers within the Content Head, producing a corrected segmentation mask.
  • Figure 2: Qualitative results on challenging OOD samples. Our method is more robust to domain shifts and can be interactively corrected via language prompts to achieve superior accuracy.
  • Figure 3: Feature-space and performance analysis. (a) Our method forces source and target domain features into a single, intertwined cluster. (b) It consistently outperforms baselines across all 13 abdominal organs on an OOD dataset.