FRAMER: Frequency-Aligned Self-Distillation with Adaptive Modulation Leveraging Diffusion Priors for Real-World Image Super-Resolution

Seungho Choi; Jeahun Sung; Jihyong Oh

FRAMER: Frequency-Aligned Self-Distillation with Adaptive Modulation Leveraging Diffusion Priors for Real-World Image Super-Resolution

Seungho Choi, Jeahun Sung, Jihyong Oh

TL;DR

FRAMER addresses Real-ISR's LF bias and the 'low-first, high-later' depth-wise frequency progression by introducing frequency-aligned self-distillation with adaptive modulation. It uses a final-layer teacher to supervise intermediate layers, decomposing features into LF/HF and applying IntraCL to LF and InterCL to HF, modulated by FAW and FAM. The training-time, plug-and-play framework leverages diffusion priors from backbones like SD2/SD3 without changing inference. Across multiple Real-ISR benchmarks and backbones, FRAMER delivers consistent gains in both distortion and perceptual quality, with ablations supporting the necessity of the final-layer teacher and random-layer negatives for HF refinement.

Abstract

Real-image super-resolution (Real-ISR) seeks to recover HR images from LR inputs with mixed, unknown degradations. While diffusion models surpass GANs in perceptual quality, they under-reconstruct high-frequency (HF) details due to a low-frequency (LF) bias and a depth-wise "low-first, high-later" hierarchy. We introduce FRAMER, a plug-and-play training scheme that exploits diffusion priors without changing the backbone or inference. At each denoising step, the final-layer feature map teaches all intermediate layers. Teacher and student feature maps are decomposed into LF/HF bands via FFT masks to align supervision with the model's internal frequency hierarchy. For LF, an Intra Contrastive Loss (IntraCL) stabilizes globally shared structure. For HF, an Inter Contrastive Loss (InterCL) sharpens instance-specific details using random-layer and in-batch negatives. Two adaptive modulators, Frequency-based Adaptive Weight (FAW) and Frequency-based Alignment Modulation (FAM), reweight per-layer LF/HF signals and gate distillation by current similarity. Across U-Net and DiT backbones (e.g., Stable Diffusion 2, 3), FRAMER consistently improves PSNR/SSIM and perceptual metrics (LPIPS, NIQE, MANIQA, MUSIQ). Ablations validate the final-layer teacher and random-layer negatives.

FRAMER: Frequency-Aligned Self-Distillation with Adaptive Modulation Leveraging Diffusion Priors for Real-World Image Super-Resolution

TL;DR

Abstract

FRAMER: Frequency-Aligned Self-Distillation with Adaptive Modulation Leveraging Diffusion Priors for Real-World Image Super-Resolution

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (11)