Dual-Domain CLIP-Assisted Residual Optimization Perception Model for Metal Artifact Reduction
Xinrui Zhang, Ailong Cai, Shaoyu Wang, Linyuan Wang, Zhizhong Zheng, Lei Li, Bin Yan
TL;DR
This work tackles metal artifact reduction in CT by introducing DuDoCROP, a dual-domain perceptual framework that leverages a vision-language prior (DuDoCLIP) to guide dual-domain diffusion (IR-SDE) in both image and sinogram spaces. It couples a prompt-engineered DuDoCLIP with a two-stage pipeline: a DL-based prior generation stage that yields dual-domain priors, and a downstream residual optimization stage that enforces raw data fidelity and fuses priors for final reconstruction. A new perceptual indicator (PI) is proposed to quantify generalization across diverse metal morphologies, and extensive experiments on public and clinical data show superior perceptual and numerical performance over state-of-the-art MAR methods, with notable generalization to head and clinical datasets. The approach demonstrates that integrating visual-language semantic information with diffusion priors yields robust, artifact-aware restorations and highlights promising avenues for VLM-guided imaging tasks beyond MAR.
Abstract
Metal artifacts in computed tomography (CT) imaging pose significant challenges to accurate clinical diagnosis. The presence of high-density metallic implants results in artifacts that deteriorate image quality, manifesting in the forms of streaking, blurring, or beam hardening effects, etc. Nowadays, various deep learning-based approaches, particularly generative models, have been proposed for metal artifact reduction (MAR). However, these methods have limited perception ability in the diverse morphologies of different metal implants with artifacts, which may generate spurious anatomical structures and exhibit inferior generalization capability. To address the issues, we leverage visual-language model (VLM) to identify these morphological features and introduce them into a dual-domain CLIP-assisted residual optimization perception model (DuDoCROP) for MAR. Specifically, a dual-domain CLIP (DuDoCLIP) is fine-tuned on the image domain and sinogram domain using contrastive learning to extract semantic descriptions from anatomical structures and metal artifacts. Subsequently, a diffusion model is guided by the embeddings of DuDoCLIP, thereby enabling the dual-domain prior generation. Additionally, we design prompt engineering for more precise image-text descriptions that can enhance the model's perception capability. Then, a downstream task is devised for the one-step residual optimization and integration of dual-domain priors, while incorporating raw data fidelity. Ultimately, a new perceptual indicator is proposed to validate the model's perception and generation performance. With the assistance of DuDoCLIP, our DuDoCROP exhibits at least 63.7% higher generalization capability compared to the baseline model. Numerical experiments demonstrate that the proposed method can generate more realistic image structures and outperform other SOTA approaches both qualitatively and quantitatively.
