Table of Contents
Fetching ...

Joint Enhancement and Classification using Coupled Diffusion Models of Signals and Logits

Gilad Nurko, Roi Benita, Yehoshua Dissen, Tomohiro Nakatani, Marc Delcroix, Shoko Araki, Joseph Keshet

TL;DR

This work tackles robust classification under noise by coupling two diffusion processes: one operating on the input signal and another on the classifier's logits, with mutual guidance that refines both signal reconstruction and semantic predictions without retraining the classifier. It formalizes three interaction strategies—Parallel, Alternating, and Nested—and provides detailed derivations for joint reverse-time updates, conditioning, and training objectives. Across image classification and ASR tasks, the coupled framework consistently outperforms traditional enhancement pipelines and CARD baselines, yielding improvements in accuracy and WER under diverse degradations. The results highlight a broader paradigm in which generative diffusion models act as adaptive inference engines that reshape inputs to support downstream decision-making, without altering the downstream model weights.

Abstract

Robust classification in noisy environments remains a fundamental challenge in machine learning. Standard approaches typically treat signal enhancement and classification as separate, sequential stages: first enhancing the signal and then applying a classifier. This approach fails to leverage the semantic information in the classifier's output during denoising. In this work, we propose a general, domain-agnostic framework that integrates two interacting diffusion models: one operating on the input signal and the other on the classifier's output logits, without requiring any retraining or fine-tuning of the classifier. This coupled formulation enables mutual guidance, where the enhancing signal refines the class estimation and, conversely, the evolving class logits guide the signal reconstruction towards discriminative regions of the manifold. We introduce three strategies to effectively model the joint distribution of the input and the logit. We evaluated our joint enhancement method for image classification and automatic speech recognition. The proposed framework surpasses traditional sequential enhancement baselines, delivering robust and flexible improvements in classification accuracy under diverse noise conditions.

Joint Enhancement and Classification using Coupled Diffusion Models of Signals and Logits

TL;DR

This work tackles robust classification under noise by coupling two diffusion processes: one operating on the input signal and another on the classifier's logits, with mutual guidance that refines both signal reconstruction and semantic predictions without retraining the classifier. It formalizes three interaction strategies—Parallel, Alternating, and Nested—and provides detailed derivations for joint reverse-time updates, conditioning, and training objectives. Across image classification and ASR tasks, the coupled framework consistently outperforms traditional enhancement pipelines and CARD baselines, yielding improvements in accuracy and WER under diverse degradations. The results highlight a broader paradigm in which generative diffusion models act as adaptive inference engines that reshape inputs to support downstream decision-making, without altering the downstream model weights.

Abstract

Robust classification in noisy environments remains a fundamental challenge in machine learning. Standard approaches typically treat signal enhancement and classification as separate, sequential stages: first enhancing the signal and then applying a classifier. This approach fails to leverage the semantic information in the classifier's output during denoising. In this work, we propose a general, domain-agnostic framework that integrates two interacting diffusion models: one operating on the input signal and the other on the classifier's output logits, without requiring any retraining or fine-tuning of the classifier. This coupled formulation enables mutual guidance, where the enhancing signal refines the class estimation and, conversely, the evolving class logits guide the signal reconstruction towards discriminative regions of the manifold. We introduce three strategies to effectively model the joint distribution of the input and the logit. We evaluated our joint enhancement method for image classification and automatic speech recognition. The proposed framework surpasses traditional sequential enhancement baselines, delivering robust and flexible improvements in classification accuracy under diverse noise conditions.
Paper Structure (85 sections, 2 theorems, 46 equations, 5 figures, 4 tables, 7 algorithms)

This paper contains 85 sections, 2 theorems, 46 equations, 5 figures, 4 tables, 7 algorithms.

Key Result

Proposition 3.1

Given a corrupted observation $\bm{x}_{cor}$ and a logit estimate $\hat{\bm{y}}_0^{i-1}$ from the previous iteration, the Alternating strategy approximately maximizes the joint conditional probability, at each iteration $i$. where $\hat{\bm{x}}_0^{i}$ is the clean signal point estimate obtained at the terminal step ($t=0$) of the signal diffusion trajectory.

Figures (5)

  • Figure 1: Comparison of noisy signal classification paradigms. $\bm{x}_{cor}$, $\bm{y}_{cor}$ denote the corrupted signal and logits; $\hat{\bm{x}}_0$, $\hat{\bm{y}}_0$ their denoised versions. (a) Noisy: direct classification. (b) Enhanced: independent signal enhancement followed by classification. (c) CARD: denoising logits, $\bm{y}$, given fixed corrupted signal, $\bm{x}_{cor}$. (d) Proposed: coupled denoising of both signal and logits.
  • Figure 2: Overview of the proposed architectures. (a) Parallel: $\bm{x}$ and $\bm{y}$ are denoised simultaneously, exchanging estimates $\hat{\bm{x}}_0$ and $\hat{\bm{y}}_0$ at each timestep $t$ for step-wise cross-conditioning. (b) Alternating: an iterative refinement strategy where complete diffusion trajectories for $\bm{x}$ and $\bm{y}$ condition each other in alternating turns. (c) Nested: a guidance mechanism where every single step of the outer $\bm{y}$ process triggers a full inner diffusion loop over $\bm{x}$.
  • Figure 3: Qualitative comparison under 30% GN. The enhanced baseline incorrectly recognizes "2," whereas our coupled strategies (Alternating, Parallel, and Nested) correctly recover the digit by leveraging logit-space guidance, reconstructing the missing upper-left stroke, enabling correct recognition as "8."
  • Figure 4: Robustness vs. Sampling Steps. Accuracy as a function of the number of diffusion steps on CIFAR-100 and ImageNet32-100 under $30\%$ GN. The Parallel strategy converges rapidly, while Nested and Alternating require more steps to saturate.
  • Figure 5: Qualitative signal enhancement examples on MNIST. Each row corresponds to a different test sample. From left to right: corrupted input (ground truth shown for reference), Enhanced baseline, Alternating, Parallel, and Nested strategies. The predicted class under each reconstruction is indicated in the figure. Our coupled strategies consistently recover digit structures that are visually clearer and semantically aligned with the true class, leading to correct predictions.

Theorems & Definitions (2)

  • Proposition 3.1
  • Proposition 3.2