Table of Contents
Fetching ...

Teacher-Guided Causal Interventions for Image Denoising: Orthogonal Content-Noise Disentanglement in Vision Transformers

Kuai Jiang, Zhaoyan Ding, Guijuan Zhang, Dianjie Lu, Zhuoran Zheng

TL;DR

This work proposes the Teacher-Guided Causal Disentanglement Network (TCD-Net), which explicitly decomposes the generative mechanism via structured interventions on feature spaces within a Vision Transformer framework, and integrates three key components.

Abstract

Conventional image denoising models often inadvertently learn spurious correlations between environmental factors and noise patterns. Moreover, due to high-frequency ambiguity, they struggle to reliably distinguish subtle textures from stochastic noise, resulting in over-removed details or residual noise artifacts. We therefore revisit denoising via causal intervention, arguing that purely correlational fitting entangles intrinsic content with extrinsic noise, which directly degrades robustness under distribution shifts. Motivated by this, we propose the Teacher-Guided Causal Disentanglement Network (TCD-Net), which explicitly decomposes the generative mechanism via structured interventions on feature spaces within a Vision Transformer framework. Specifically, our method integrates three key components: (1) An Environmental Bias Adjustment (EBA) module projects features into a stable, de-centered subspace to suppress global environmental bias (de-confounding). (2) A dual-branch disentanglement head employs an orthogonality constraint to force a strict separation between content and noise representations, preventing information leakage. (3) To resolve structural ambiguity, we leverage Nano Banana Pro, Google's reasoning-guided AI image generation model, to guide a causal prior, effectively pulling content representations back onto the natural-image manifold. Extensive experiments demonstrate that TCD-Net outperforms mainstream methods across multiple benchmarks in both fidelity and efficiency, achieving a real-time speed of 104.2 FPS on a single RTX 5090 GPU.

Teacher-Guided Causal Interventions for Image Denoising: Orthogonal Content-Noise Disentanglement in Vision Transformers

TL;DR

This work proposes the Teacher-Guided Causal Disentanglement Network (TCD-Net), which explicitly decomposes the generative mechanism via structured interventions on feature spaces within a Vision Transformer framework, and integrates three key components.

Abstract

Conventional image denoising models often inadvertently learn spurious correlations between environmental factors and noise patterns. Moreover, due to high-frequency ambiguity, they struggle to reliably distinguish subtle textures from stochastic noise, resulting in over-removed details or residual noise artifacts. We therefore revisit denoising via causal intervention, arguing that purely correlational fitting entangles intrinsic content with extrinsic noise, which directly degrades robustness under distribution shifts. Motivated by this, we propose the Teacher-Guided Causal Disentanglement Network (TCD-Net), which explicitly decomposes the generative mechanism via structured interventions on feature spaces within a Vision Transformer framework. Specifically, our method integrates three key components: (1) An Environmental Bias Adjustment (EBA) module projects features into a stable, de-centered subspace to suppress global environmental bias (de-confounding). (2) A dual-branch disentanglement head employs an orthogonality constraint to force a strict separation between content and noise representations, preventing information leakage. (3) To resolve structural ambiguity, we leverage Nano Banana Pro, Google's reasoning-guided AI image generation model, to guide a causal prior, effectively pulling content representations back onto the natural-image manifold. Extensive experiments demonstrate that TCD-Net outperforms mainstream methods across multiple benchmarks in both fidelity and efficiency, achieving a real-time speed of 104.2 FPS on a single RTX 5090 GPU.
Paper Structure (26 sections, 13 equations, 5 figures, 5 tables)

This paper contains 26 sections, 13 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Efficiency--performance trade-off. PSNR vs. FPS on CBSD68 ($\sigma{=}15$) and SIDD. TCD-Net achieves the best speed--quality trade-off among competing methods.
  • Figure 2: Conceptual comparison between conventional denoising and TCD-Net.(a) Conventional denoising is prone to spurious content--noise correlations induced by environmental factors $E$, leaving residual artifacts. (b) TCD-Net performs causal intervention via EBA, incorporates teacher semantic guidance, and enforces an orthogonality constraint to decouple content and noise for cleaner restoration.
  • Figure 3: Overview of TCD-Net. A ViT backbone with EBA feeds a dual-branch head to predict the restored image $\hat{X}$ and noise map $\hat{N}$, trained with orthogonality, noise anchoring, and teacher guidance.
  • Figure 4: EBA module. LayerNorm + bottleneck MLP with residual projection to suppress environment-induced bias and stabilize token representations.
  • Figure 5: Qualitative comparison on synthetic and real noise.(a) Synthetic Gaussian denoising (AWGN).(b) Real-world denoising on SIDD. Conventional methods may leave residual noise or oversmooth details due to spurious content--noise correlation. With EBA-based intervention, orthogonal disentanglement, and teacher guidance, TCD-Net restores cleaner results with sharper textures and edges.