Table of Contents
Fetching ...

MedSteer: Counterfactual Endoscopic Synthesis via Training-Free Activation Steering

Trong-Thang Pham, Loc Nguyen, Anh Nguyen, Hien Nguyen, Ngan Le

TL;DR

MedSteer, a training-free activation-steering framework for endoscopic synthesis that outperforms the best inversion-based baseline in both concept flip rate and structural preservation, is proposed.

Abstract

Generative diffusion models are increasingly used for medical imaging data augmentation, but text prompting cannot produce causal training data. Re-prompting rerolls the entire generation trajectory, altering anatomy, texture, and background. Inversion-based editing methods introduce reconstruction error that causes structural drift. We propose MedSteer, a training-free activation-steering framework for endoscopic synthesis. MedSteer identifies a pathology vector for each contrastive prompt pair in the cross-attention layers of a diffusion transformer. At inference time, it steers image activations along this vector, generating counterfactual pairs from scratch where the only difference is the steered concept. All other structure is preserved by construction. We evaluate MedSteer across three experiments on Kvasir v3 and HyperKvasir. On counterfactual generation across three clinical concept pairs, MedSteer achieves flip rates of 0.800, 0.925, and 0.950, outperforming the best inversion-based baseline in both concept flip rate and structural preservation. On dye disentanglement, MedSteer achieves 75% dye removal against 20% (PnP) and 10% (h-Edit). On downstream polyp detection, augmenting with MedSteer counterfactual pairs achieves ViT AUC of 0.9755 versus 0.9083 for quantity-matched re-prompting, confirming that counterfactual structure drives the gain. Code is at link https://github.com/phamtrongthang123/medsteer

MedSteer: Counterfactual Endoscopic Synthesis via Training-Free Activation Steering

TL;DR

MedSteer, a training-free activation-steering framework for endoscopic synthesis that outperforms the best inversion-based baseline in both concept flip rate and structural preservation, is proposed.

Abstract

Generative diffusion models are increasingly used for medical imaging data augmentation, but text prompting cannot produce causal training data. Re-prompting rerolls the entire generation trajectory, altering anatomy, texture, and background. Inversion-based editing methods introduce reconstruction error that causes structural drift. We propose MedSteer, a training-free activation-steering framework for endoscopic synthesis. MedSteer identifies a pathology vector for each contrastive prompt pair in the cross-attention layers of a diffusion transformer. At inference time, it steers image activations along this vector, generating counterfactual pairs from scratch where the only difference is the steered concept. All other structure is preserved by construction. We evaluate MedSteer across three experiments on Kvasir v3 and HyperKvasir. On counterfactual generation across three clinical concept pairs, MedSteer achieves flip rates of 0.800, 0.925, and 0.950, outperforming the best inversion-based baseline in both concept flip rate and structural preservation. On dye disentanglement, MedSteer achieves 75% dye removal against 20% (PnP) and 10% (h-Edit). On downstream polyp detection, augmenting with MedSteer counterfactual pairs achieves ViT AUC of 0.9755 versus 0.9083 for quantity-matched re-prompting, confirming that counterfactual structure drives the gain. Code is at link https://github.com/phamtrongthang123/medsteer
Paper Structure (14 sections, 3 figures, 4 tables)

This paper contains 14 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: MedSteer method pipeline. A&N: Add & Norm. FF: Feed Forward. CA: Cross-Attention. SA: Self-Attention. (a) Offline Pathology Vector Estimation: CA features $h_{l,t,z}$ are collected from the frozen DiT for positive and negative prompts across $Z$ random seeds. A Mean step yields $\bar{h}^{pos}_{l,t}$ and $\bar{h}^{neg}_{l,t}$. a S&N (Subtract & Normalize) step then produces the unit pathology vector $v_{l,t}$ (dyed lifted polyp $\to$ polyp). (b) Inference-Time Steering:Unsteered Inference runs the frozen DiT unmodified. Steered Inference applies Spatially Selective Pathology Steering (SSPS) at layers $l \in \{L_s,\dots,L_e\}$ across all $T$ denoising steps. Inside SSPS, a CSG (Cosine-similarity gate) produces the per-token score $\sigma_{l,t}$, which is scaled by $\alpha$ and fed to an Update step that subtracts the aligned component from $h_{l,t}$, yielding the counterfactual activation $h'_{l,t}$. While both inference branches are shown together here, only the desired branch is executed in practice.
  • Figure 2: Qualitative comparison across concept pairs. Rows: Unsteered, PnP, h-Edit, and MedSteer (Ours). Columns show Polyp $\to$ Normal Cecum, Ulcerative Colitis $\to$ Normal Cecum, Esophagitis $\to$ Normal Z-line, and dye steerings.
  • Figure 3: Left: Per-token cosine similarity maps $\sigma_{8,t}$ at layer 8 for selected diffusion steps $t\in\{8,12,14,19\}$ (left to right). Warmer colours indicate stronger alignment with the pathology vector. Right: corresponding unsteered (dyed lifted polyps) and steered (normal cecum) endoscopy images.