Table of Contents
Fetching ...

DACESR: Degradation-Aware Conditional Embedding for Real-World Image Super-Resolution

Xiaoyan Lei, Wenlong Zhang, Biao Luo, Hui Liang, Weifeng Cao, Qiuting Lin

TL;DR

This paper revisits the capabilities of the Recognize Anything Model for degraded images by calculating text similarity and proposes a Real Embedding Extractor (REE), which achieves significant recognition performance gain on degraded image content through contrastive learning.

Abstract

Multimodal large models have shown excellent ability in addressing image super-resolution in real-world scenarios by leveraging language class as condition information, yet their abilities in degraded images remain limited. In this paper, we first revisit the capabilities of the Recognize Anything Model (RAM) for degraded images by calculating text similarity. We find that directly using contrastive learning to fine-tune RAM in the degraded space is difficult to achieve acceptable results. To address this issue, we employ a degradation selection strategy to propose a Real Embedding Extractor (REE), which achieves significant recognition performance gain on degraded image content through contrastive learning. Furthermore, we use a Conditional Feature Modulator (CFM) to incorporate the high-level information of REE for a powerful Mamba-based network, which can leverage effective pixel information to restore image textures and produce visually pleasing results. Extensive experiments demonstrate that the REE can effectively help image super-resolution networks balance fidelity and perceptual quality, highlighting the great potential of Mamba in real-world applications. The source code of this work will be made publicly available at: https://github.com/nathan66666/DACESR.git

DACESR: Degradation-Aware Conditional Embedding for Real-World Image Super-Resolution

TL;DR

This paper revisits the capabilities of the Recognize Anything Model for degraded images by calculating text similarity and proposes a Real Embedding Extractor (REE), which achieves significant recognition performance gain on degraded image content through contrastive learning.

Abstract

Multimodal large models have shown excellent ability in addressing image super-resolution in real-world scenarios by leveraging language class as condition information, yet their abilities in degraded images remain limited. In this paper, we first revisit the capabilities of the Recognize Anything Model (RAM) for degraded images by calculating text similarity. We find that directly using contrastive learning to fine-tune RAM in the degraded space is difficult to achieve acceptable results. To address this issue, we employ a degradation selection strategy to propose a Real Embedding Extractor (REE), which achieves significant recognition performance gain on degraded image content through contrastive learning. Furthermore, we use a Conditional Feature Modulator (CFM) to incorporate the high-level information of REE for a powerful Mamba-based network, which can leverage effective pixel information to restore image textures and produce visually pleasing results. Extensive experiments demonstrate that the REE can effectively help image super-resolution networks balance fidelity and perceptual quality, highlighting the great potential of Mamba in real-world applications. The source code of this work will be made publicly available at: https://github.com/nathan66666/DACESR.git
Paper Structure (28 sections, 8 equations, 8 figures, 6 tables)

This paper contains 28 sections, 8 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: The tag representations of RAM on clean images and images with varying levels of degradation."Similarity" refers to the Jaccard similarity (Eq. (\ref{['eq:sim']})).
  • Figure 2: Comparison of text output accuracy across RAM under different types and intensities of degradation. (a) Blur: The x-axis represents isotropic Gaussian blur, where larger values indicate stronger blurring. (b) JPEG: The x-axis denotes JPEG compression levels, with lower values indicating higher compression. (c) Noise: The x-axis represents the intensity of additive Gaussian noise, where higher values correspond to increased noise levels. In (d), the classification is based on the text similarity values of RAM for different degraded outputs, which are evenly divided into four categories in descending order. Each category contains multiple types/levels of degradation.
  • Figure 3: The overview of DACESR.
  • Figure 4: The training pipeline of the Real Embedding Extractor (REE).
  • Figure 5: The LAM results of different model architectures across various types of degradation. LAM attribution indicates the significance of each pixel in the input LR image during the reconstruction process of the patch highlighted by a box. The Diffusion Index (DI) denotes the extent of pixel involvement. A higher DI indicates a broader range of utilized pixels.
  • ...and 3 more figures