Disentangled Textual Priors for Diffusion-based Image Super-Resolution

Lei Jiang; Xin Liu; Xinze Tong; Zhiliang Li; Jie Liu; Jie Tang; Gangshan Wu

Disentangled Textual Priors for Diffusion-based Image Super-Resolution

Lei Jiang, Xin Liu, Xinze Tong, Zhiliang Li, Jie Liu, Jie Tang, Gangshan Wu

TL;DR

DTPSR is proposed, a novel diffusion-based SR framework that introduces disentangled textual priors along two complementary dimensions: spatial hierarchy (global vs. local) and frequency semantics (low- vs. high-frequency).

Abstract

Image Super-Resolution (SR) aims to reconstruct high-resolution images from degraded low-resolution inputs. While diffusion-based SR methods offer powerful generative capabilities, their performance heavily depends on how semantic priors are structured and integrated into the generation process. Existing approaches often rely on entangled or coarse-grained priors that mix global layout with local details, or conflate structural and textural cues, thereby limiting semantic controllability and interpretability. In this work, we propose DTPSR, a novel diffusion-based SR framework that introduces disentangled textual priors along two complementary dimensions: spatial hierarchy (global vs. local) and frequency semantics (low- vs. high-frequency). By explicitly separating these priors, DTPSR enables the model to simultaneously capture scene-level structure and object-specific details with frequency-aware semantic guidance. The corresponding embeddings are injected via specialized cross-attention modules, forming a progressive generation pipeline that reflects the semantic granularity of visual content, from global layout to fine-grained textures. To support this paradigm, we construct DisText-SR, a large-scale dataset containing approximately 95,000 image-text pairs with carefully disentangled global, low-frequency, and high-frequency descriptions. To further enhance controllability and consistency, we adopt a multi-branch classifier-free guidance strategy with frequency-aware negative prompts to suppress hallucinations and semantic drift. Extensive experiments on synthetic and real-world benchmarks show that DTPSR achieves high perceptual quality, competitive fidelity, and strong generalization across diverse degradation scenarios.

Disentangled Textual Priors for Diffusion-based Image Super-Resolution

TL;DR

Abstract

Paper Structure (16 sections, 13 equations, 4 figures, 7 tables)

This paper contains 16 sections, 13 equations, 4 figures, 7 tables.

Introduction
Related Works
Denoising Diffusion Probabilistic Models
Diffusion-Based Super-Resolution
Text-Guided Diffusion for Image SR
Method
Motivation and Overview
Design of the DTPSR Framework
Construction of the DisText-SR Dataset
Multi-branch Classifier-Free Guidance
Experiments
Experimental Setting
Comparison with State-of-the-Arts
Computational Efficiency
Ablation Study
...and 1 more sections

Figures (4)

Figure 1: Comparison under severe degradation. Without textual priors, the diffusion model suffers from hallucinations, generating human-like artifacts or misinterpreting walls as ocean textures. Incorporating our disentangled textual priors suppresses such errors and enhances semantic consistency.
Figure 2: The overall architecture of DTPSR. Given an LR image, a global prior and local object priors are extracted, where local priors are disentangled into Low-Frequency (LF) descriptions (shape, layout, color) and High-Frequency (HF) descriptions (texture, edges, details). They are encoded into $\mathbf{e}_g$, $\mathbf{E}_{lf}$, and $\mathbf{E}_{hf}$ via CLIP, while the LR image is encoded into $\mathbf{f}_{lr}$ by an image encoder. The diffusion process sequentially updates the latent $\mathbf{z}_t$ through GTCA, LFCA, and HFCA—yielding intermediate representations $\mathbf{z}_t^g$, $\mathbf{z}_t^{lf}$, and $\mathbf{z}_t^{hf}$—and then fuses $\mathbf{f}_{lr}$ via LRCA to produce $\mathbf{z}_{t-1}$, progressively restoring structures and details.
Figure 3: DisText-SR dataset construction. Given an LR image, we use a segmentation model to extract regions $S_1, S_2, \dots, S_n$, and query a frozen MLLM to generate a global description $c_g$ and per-segment low-frequency ($c_{lf}^{(i)}$) and high-frequency ($c_{hf}^{(i)}$) texts. They are encoded via CLIP into Global, Low-Frequency (LF) and High-Frequency (HF) embeddings for semantic-guided SR.
Figure 4: Qualitative comparison with representative SR methods. Our DTPSR reconstructs sharper textures and more semantically aligned details, especially under complex degradations, compared to both GAN-based (e.g., BSRGAN, Real-ESRGAN) and diffusion-based (e.g., FaithDiff, SUPIR) approaches. Zoom in for better visual comparison. More qualitative comparisons can be found in Sec. 4 of the supplementary material.

Disentangled Textual Priors for Diffusion-based Image Super-Resolution

TL;DR

Abstract

Disentangled Textual Priors for Diffusion-based Image Super-Resolution

Authors

TL;DR

Abstract

Table of Contents

Figures (4)