SALIENT: Frequency-Aware Paired Diffusion for Controllable Long-Tail CT Detection

Yifan Li; Mehrdad Salimitari; Taiyu Zhang; Guang Li; David Dreizin

SALIENT: Frequency-Aware Paired Diffusion for Controllable Long-Tail CT Detection

Yifan Li, Mehrdad Salimitari, Taiyu Zhang, Guang Li, David Dreizin

TL;DR

SALIENT demonstrates that frequency-aware diffusion enables controllable, computationally efficient precision rescue in long-tail CT detection, and improves generative realism.

Abstract

Detection of rare lesions in whole-body CT is fundamentally limited by extreme class imbalance and low target-to-volume ratios, producing precision collapse despite high AUROC. Synthetic augmentation with diffusion models offers promise, yet pixel-space diffusion is computationally expensive, and existing mask-conditioned approaches lack controllable attribute-level regulation and paired supervision for accountable training. We introduce SALIENT, a mask-conditioned wavelet-domain diffusion framework that synthesizes paired lesion-masking volumes for controllable CT augmentation under long-tail regimes. Instead of denoising in pixel space, SALIENT performs structured diffusion over discrete wavelet coefficients, explicitly separating low-frequency brightness from high-frequency structural detail. Learnable frequency-aware objectives disentangle target and background attributes (structure, contrast, edge fidelity), enabling interpretable and stable optimization. A 3D VAE generates diverse volumetric lesion masks, and a semi-supervised teacher produces paired slice-level pseudo-labels for downstream mask-guided detection. SALIENT improves generative realism, as reflected by higher MS-SSIM (0.63 to 0.83) and lower FID (118.4 to 46.5). In a separate downstream evaluation, SALIENT-augmented training improves long-tail detection performance, yielding disproportionate AUPRC gains across low prevalences and target-to-volume ratios. Optimal synthetic ratios shift from 2x to 4x as labeled seed size decreases, indicating a seed-dependent augmentation regime under low-label conditions. SALIENT demonstrates that frequency-aware diffusion enables controllable, computationally efficient precision rescue in long-tail CT detection.

SALIENT: Frequency-Aware Paired Diffusion for Controllable Long-Tail CT Detection

TL;DR

SALIENT demonstrates that frequency-aware diffusion enables controllable, computationally efficient precision rescue in long-tail CT detection, and improves generative realism.

Abstract

Paper Structure (28 sections, 19 equations, 5 figures, 1 table)

This paper contains 28 sections, 19 equations, 5 figures, 1 table.

Introduction
Related Work
Long-Tail Detection in Medical Imaging
Synthetic Augmentation with Diffusion Models
Augmentation Dose-Response
Proposed Method
Dataset
SALIENT
Overview.
Wavelet-Domain Formulation.
Mask-Guided Wavelet UNet.
Objective and Modeling Principle
Wavelet-Aware Training Objective
Structured Classifier-Free Guidance
3D VAE for Volumetric Lesion Mask Generation
...and 13 more sections

Figures (5)

Figure 1: Overview of the proposed SALIENT synthetic data and classification pipeline. Real CT volumes and lesion masks are first processed by a 3D VAE to generate diverse volumetric masks, which are projected into 2D slice space and used as conditioning signals for the wavelet-domain diffusion model. SALIENT operates on discrete wavelet coefficients to synthesize mask-guided CT slices, which are subsequently pseudo-labeled by a semi-supervised segmentation teacher (UCMT). The resulting synthetic CT--mask pairs augment training for a slice-level mask-guided ResNet-50 classifier. Slice-level predictions are further aggregated into subject-level decisions using an Embedded Vision Transformer (EViT)islam2024seeking.
Figure 2: Architecture of SALIENT in the wavelet domain. A central CT slice and its axial neighbors are transformed into wavelet coefficients (LL, LH, HL, HH). A mask-gated frequency scaling (FSA) module modulates the noisy coefficients before concatenation with a 2.5D conditioning stack. A time-conditioned UNet predicts clean wavelet coefficients at each diffusion step, which are reconstructed into synthetic CT slices via inverse DWT.
Figure 3: Comparison of synthetic CT slices generated by pixel-space MedDDPM and wavelet-domain SALIENT. SALIENT produces sharper anatomical boundaries, reduced high-frequency noise, and improved contrast stability in mediastinal regions.
Figure 4: Quantitative frequency and intensity comparison between Real CT, pixel-space MedDDPM, and SALIENT. (Left) Per-slice LL standard deviation. (Middle) High-frequency variance per slice (LH/HL/HH). (Right) ROI intensity histograms using method-specific masks.
Figure 5: Saliency alignment improves with paired augmentation. Top: training without mask guidance; middle: without synthetic data; bottom: with SALIENT paired CT--mask augmentation. SALIENT encourages lesion-focused evidence and reduces spurious activations on irrelevant anatomy.

SALIENT: Frequency-Aware Paired Diffusion for Controllable Long-Tail CT Detection

TL;DR

Abstract

SALIENT: Frequency-Aware Paired Diffusion for Controllable Long-Tail CT Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (5)