Table of Contents
Fetching ...

Dual-Domain CLIP-Assisted Residual Optimization Perception Model for Metal Artifact Reduction

Xinrui Zhang, Ailong Cai, Shaoyu Wang, Linyuan Wang, Zhizhong Zheng, Lei Li, Bin Yan

TL;DR

This work tackles metal artifact reduction in CT by introducing DuDoCROP, a dual-domain perceptual framework that leverages a vision-language prior (DuDoCLIP) to guide dual-domain diffusion (IR-SDE) in both image and sinogram spaces. It couples a prompt-engineered DuDoCLIP with a two-stage pipeline: a DL-based prior generation stage that yields dual-domain priors, and a downstream residual optimization stage that enforces raw data fidelity and fuses priors for final reconstruction. A new perceptual indicator (PI) is proposed to quantify generalization across diverse metal morphologies, and extensive experiments on public and clinical data show superior perceptual and numerical performance over state-of-the-art MAR methods, with notable generalization to head and clinical datasets. The approach demonstrates that integrating visual-language semantic information with diffusion priors yields robust, artifact-aware restorations and highlights promising avenues for VLM-guided imaging tasks beyond MAR.

Abstract

Metal artifacts in computed tomography (CT) imaging pose significant challenges to accurate clinical diagnosis. The presence of high-density metallic implants results in artifacts that deteriorate image quality, manifesting in the forms of streaking, blurring, or beam hardening effects, etc. Nowadays, various deep learning-based approaches, particularly generative models, have been proposed for metal artifact reduction (MAR). However, these methods have limited perception ability in the diverse morphologies of different metal implants with artifacts, which may generate spurious anatomical structures and exhibit inferior generalization capability. To address the issues, we leverage visual-language model (VLM) to identify these morphological features and introduce them into a dual-domain CLIP-assisted residual optimization perception model (DuDoCROP) for MAR. Specifically, a dual-domain CLIP (DuDoCLIP) is fine-tuned on the image domain and sinogram domain using contrastive learning to extract semantic descriptions from anatomical structures and metal artifacts. Subsequently, a diffusion model is guided by the embeddings of DuDoCLIP, thereby enabling the dual-domain prior generation. Additionally, we design prompt engineering for more precise image-text descriptions that can enhance the model's perception capability. Then, a downstream task is devised for the one-step residual optimization and integration of dual-domain priors, while incorporating raw data fidelity. Ultimately, a new perceptual indicator is proposed to validate the model's perception and generation performance. With the assistance of DuDoCLIP, our DuDoCROP exhibits at least 63.7% higher generalization capability compared to the baseline model. Numerical experiments demonstrate that the proposed method can generate more realistic image structures and outperform other SOTA approaches both qualitatively and quantitatively.

Dual-Domain CLIP-Assisted Residual Optimization Perception Model for Metal Artifact Reduction

TL;DR

This work tackles metal artifact reduction in CT by introducing DuDoCROP, a dual-domain perceptual framework that leverages a vision-language prior (DuDoCLIP) to guide dual-domain diffusion (IR-SDE) in both image and sinogram spaces. It couples a prompt-engineered DuDoCLIP with a two-stage pipeline: a DL-based prior generation stage that yields dual-domain priors, and a downstream residual optimization stage that enforces raw data fidelity and fuses priors for final reconstruction. A new perceptual indicator (PI) is proposed to quantify generalization across diverse metal morphologies, and extensive experiments on public and clinical data show superior perceptual and numerical performance over state-of-the-art MAR methods, with notable generalization to head and clinical datasets. The approach demonstrates that integrating visual-language semantic information with diffusion priors yields robust, artifact-aware restorations and highlights promising avenues for VLM-guided imaging tasks beyond MAR.

Abstract

Metal artifacts in computed tomography (CT) imaging pose significant challenges to accurate clinical diagnosis. The presence of high-density metallic implants results in artifacts that deteriorate image quality, manifesting in the forms of streaking, blurring, or beam hardening effects, etc. Nowadays, various deep learning-based approaches, particularly generative models, have been proposed for metal artifact reduction (MAR). However, these methods have limited perception ability in the diverse morphologies of different metal implants with artifacts, which may generate spurious anatomical structures and exhibit inferior generalization capability. To address the issues, we leverage visual-language model (VLM) to identify these morphological features and introduce them into a dual-domain CLIP-assisted residual optimization perception model (DuDoCROP) for MAR. Specifically, a dual-domain CLIP (DuDoCLIP) is fine-tuned on the image domain and sinogram domain using contrastive learning to extract semantic descriptions from anatomical structures and metal artifacts. Subsequently, a diffusion model is guided by the embeddings of DuDoCLIP, thereby enabling the dual-domain prior generation. Additionally, we design prompt engineering for more precise image-text descriptions that can enhance the model's perception capability. Then, a downstream task is devised for the one-step residual optimization and integration of dual-domain priors, while incorporating raw data fidelity. Ultimately, a new perceptual indicator is proposed to validate the model's perception and generation performance. With the assistance of DuDoCLIP, our DuDoCROP exhibits at least 63.7% higher generalization capability compared to the baseline model. Numerical experiments demonstrate that the proposed method can generate more realistic image structures and outperform other SOTA approaches both qualitatively and quantitatively.
Paper Structure (36 sections, 19 equations, 13 figures, 5 tables, 2 algorithms)

This paper contains 36 sections, 19 equations, 13 figures, 5 tables, 2 algorithms.

Figures (13)

  • Figure 1: The simplified framework of our DuDoCROP and fundamental principles of the DuDoCLIP model. The metrics "PSNR/SSIM/LPIPS/PI" are used to evaluate the effect of the DuDoCLIP model. Where the average PI=(PI$_s$+PI$_q$)/2 is proposed in section III.D. Comparisons: (a) Metal-affected images (b) IR-SDE (c) IR-SDE w/ DuDoCLIP
  • Figure 2: The specific structure of proposed DuDoCLIP. (a) The overall architecture of DuDoCLIP. (b) The interaction of the image encoder and control net. The control net is the copy of the CLIP image encoder which is the cascaded visual transformer (ViT) blocksdosovitskiy2020image architecture. The zero convolution layer allows for the introduction of conditional control while maintaining model stability, enabling fine adjustments to the generated embeddings.
  • Figure 3: The overall architecture of proposed DuDoCROP. (a) The inputs of DuDoCROP are paired metal-affected images $X_\text{ma}$ and sinograms $Y_\text{ma}$. At the DuDoCLIP-assisted prior generation (DAPG) stage, the dual-domain IR-SDEs are guided by embeddings $e_{x}^{I_{1,2}}$ and $e_{y}^{I_{1,2}}$, respectively. The prior images $X_p$ and $Y_p$ are generated with the assistance of DuDoCLIP. At the one-step residual optimization (OSRO) stage, the data fidelity from the raw sinogram $Y_\text{ma}$ is introduced using Eq. (\ref{['deqn_ex13a']}). The Norm$^1$ is to normalize $X_p$ using $X'_p$, and Norm$^2$ is to normalize $\hat{X}_p$ using $X_p$. With normalization, the distance between the two distributions is reduced. The residual $X_p-X'_p$ is refined as $\hat{X}_r$ through IR-SDE and utilized to update image-domain prior as $\hat{X}_p$ using Eq. (\ref{['deqn_ex16a']}). Finally, the two priors are fused with factors for the final output. (b) The image and control embeddings are introduced into a conditional network of IR-SDE through integration and cross-attention (explained in the second part of section III.B).
  • Figure 4: The IR-SDE diffusion model contains forward SDE and reverse SDE. Each step in the reverse process can be regarded as a prediction of the mean and variance
  • Figure 5: The prompt engineering of DuDoCLIP. We adopt the prompt pattern of "Qualifier + Instruction" to generate the expected text description. A1-A4 are descriptions of prompts P1-P4, respectively. The accuracy of the prompt determines the proportion of positive and negative(P/N) text description words.
  • ...and 8 more figures