Table of Contents
Fetching ...

FRAMER: Frequency-Aligned Self-Distillation with Adaptive Modulation Leveraging Diffusion Priors for Real-World Image Super-Resolution

Seungho Choi, Jeahun Sung, Jihyong Oh

TL;DR

FRAMER addresses Real-ISR's LF bias and the 'low-first, high-later' depth-wise frequency progression by introducing frequency-aligned self-distillation with adaptive modulation. It uses a final-layer teacher to supervise intermediate layers, decomposing features into LF/HF and applying IntraCL to LF and InterCL to HF, modulated by FAW and FAM. The training-time, plug-and-play framework leverages diffusion priors from backbones like SD2/SD3 without changing inference. Across multiple Real-ISR benchmarks and backbones, FRAMER delivers consistent gains in both distortion and perceptual quality, with ablations supporting the necessity of the final-layer teacher and random-layer negatives for HF refinement.

Abstract

Real-image super-resolution (Real-ISR) seeks to recover HR images from LR inputs with mixed, unknown degradations. While diffusion models surpass GANs in perceptual quality, they under-reconstruct high-frequency (HF) details due to a low-frequency (LF) bias and a depth-wise "low-first, high-later" hierarchy. We introduce FRAMER, a plug-and-play training scheme that exploits diffusion priors without changing the backbone or inference. At each denoising step, the final-layer feature map teaches all intermediate layers. Teacher and student feature maps are decomposed into LF/HF bands via FFT masks to align supervision with the model's internal frequency hierarchy. For LF, an Intra Contrastive Loss (IntraCL) stabilizes globally shared structure. For HF, an Inter Contrastive Loss (InterCL) sharpens instance-specific details using random-layer and in-batch negatives. Two adaptive modulators, Frequency-based Adaptive Weight (FAW) and Frequency-based Alignment Modulation (FAM), reweight per-layer LF/HF signals and gate distillation by current similarity. Across U-Net and DiT backbones (e.g., Stable Diffusion 2, 3), FRAMER consistently improves PSNR/SSIM and perceptual metrics (LPIPS, NIQE, MANIQA, MUSIQ). Ablations validate the final-layer teacher and random-layer negatives.

FRAMER: Frequency-Aligned Self-Distillation with Adaptive Modulation Leveraging Diffusion Priors for Real-World Image Super-Resolution

TL;DR

FRAMER addresses Real-ISR's LF bias and the 'low-first, high-later' depth-wise frequency progression by introducing frequency-aligned self-distillation with adaptive modulation. It uses a final-layer teacher to supervise intermediate layers, decomposing features into LF/HF and applying IntraCL to LF and InterCL to HF, modulated by FAW and FAM. The training-time, plug-and-play framework leverages diffusion priors from backbones like SD2/SD3 without changing inference. Across multiple Real-ISR benchmarks and backbones, FRAMER delivers consistent gains in both distortion and perceptual quality, with ablations supporting the necessity of the final-layer teacher and random-layer negatives for HF refinement.

Abstract

Real-image super-resolution (Real-ISR) seeks to recover HR images from LR inputs with mixed, unknown degradations. While diffusion models surpass GANs in perceptual quality, they under-reconstruct high-frequency (HF) details due to a low-frequency (LF) bias and a depth-wise "low-first, high-later" hierarchy. We introduce FRAMER, a plug-and-play training scheme that exploits diffusion priors without changing the backbone or inference. At each denoising step, the final-layer feature map teaches all intermediate layers. Teacher and student feature maps are decomposed into LF/HF bands via FFT masks to align supervision with the model's internal frequency hierarchy. For LF, an Intra Contrastive Loss (IntraCL) stabilizes globally shared structure. For HF, an Inter Contrastive Loss (InterCL) sharpens instance-specific details using random-layer and in-batch negatives. Two adaptive modulators, Frequency-based Adaptive Weight (FAW) and Frequency-based Alignment Modulation (FAM), reweight per-layer LF/HF signals and gate distillation by current similarity. Across U-Net and DiT backbones (e.g., Stable Diffusion 2, 3), FRAMER consistently improves PSNR/SSIM and perceptual metrics (LPIPS, NIQE, MANIQA, MUSIQ). Ablations validate the final-layer teacher and random-layer negatives.

Paper Structure

This paper contains 29 sections, 10 equations, 11 figures, 7 tables, 2 algorithms.

Figures (11)

  • Figure 1: Qualitative comparison with recent Real-ISR methods on real-world images. Our FRAMER models produce sharper edges and richer details, leading to more visually natural and faithful restoration results. More qualitative results are provided in Supplementary Sec. \ref{['sec:supp_qualitative_results']}.
  • Figure 2: Band-wise magnitude densities with shared bins. For each feature map, we compute the 2D FFT and collect magnitudes $\lvert F\rvert$ within LF and HF rings. We plot mean $\pm$$\sigma$ densities over samples for $\log(1+\lvert F\rvert)$ using common bin edges (HF: red or yellow, LF: blues). LF magnitudes span a broader and heavier range, whereas HF magnitudes concentrate narrowly near small values, indicating LF dominance that biases unified training toward LF and undertrains HF details. All statistics are computed on the 100-image DIV2K agustsson2017ntire test set. Densities integrate to 1; any right-edge spike is due to percentile clipping used only for visualization.
  • Figure 3: Layer-wise cosine similarity of LF and HF feature maps in U-Net ronneberger2015u (dotted line) and DiT peebles2023scalable (solid line). (a) low-noise timestep ($t{=}300$), (b) high-noise timestep ($t{=}700$).Using the final-layer feature map as reference, LF similarity converges faster in earlier layers, whereas HF similarity rises abruptly in later layers. This reveals a “low-first, high-later” depth-wise hierarchy (i.e., an LF bias), motivating our layer-adaptive, frequency-aware training strategy. (For comparability, layer depth is normalized to [0, 1].)
  • Figure 4: FRAMER: Frequency-Aligned Self-Distillation with Adaptive Modulation Leveraging Diffusion Priors (inspired by Sec. \ref{['sec:observations']}). (a) Framework Overview. During training, from an High-Resolution image $R$, we create $I_{LR}$ by random degradation wang2021real, downsampling, and resizing back to the size of $R$. We use LLaVA liu2023llava to generate a caption. The diffusion backbone (U-Net ronneberger2015u/DiT peebles2023scalable) takes $I_{LR}$, noise $Z_T$, and the caption as inputs; FRAMER is applied only during training and jointly uses IntraCL (Sec. \ref{['sec:intracl']}) and InterCL (Sec. \ref{['sec:intercl']}), modulated by FAW (Sec. \ref{['sec:faw']}) and FAM (Sec. \ref{['sec:fam']}). FRAMER is a training framework that adds auxiliary loss components only during training. At inference, it uses the original backbone without any modification, making it fully plug-and-play. (b) Intra Contrastive Loss. Within a single image, pull $F^{(i)}_{\mathrm{LF}}$ toward $F^{(n)}_{\mathrm{LF}}$ and push it away from a randomly sampled same-image layer $F^{(j)}_{\mathrm{LF}}$ (no in-batch negatives). (c) Inter Contrastive Loss. Attract $F^{(i)}_{\mathrm{HF}}$ to $F^{(n)}_{\mathrm{HF}}$ and repel a random-layer negative $F^{(j)}_{\mathrm{HF}}$ (same image) and in-batch negatives $x^{-}$ (other images).
  • Figure 5: Visualization of feature maps similarity matrices across training samples in different frequencies (brighter/redder indicates higher similarity). (a) LF exhibits strong cross sample similarity, reflecting shared structural information and motivating the use of IntraCL (Sec. \ref{['sec:intracl']}) for stabilizing global structure learning. (b) HF shows weak cross sample similarity and strong sample specific variation, justifying the use of InterCL (Sec. \ref{['sec:intercl']}) to promote fine grained, instance level discrimination without over sharpening. Detailed descriptions of the training samples are in Sec. \ref{['sec:settings']}
  • ...and 6 more figures