Table of Contents
Fetching ...

InstanceRSR: Real-World Super-Resolution via Instance-Aware Representation Alignment

Zixin Guo, Kai Zhao, Luyan Zhang

Abstract

Existing real-world super-resolution (RSR) methods based on generative priors have achieved remarkable progress in producing high-quality and globally consistent reconstructions. However, they often struggle to recover fine-grained details of diverse object instances in complex real-world scenes. This limitation primarily arises because commonly adopted denoising losses (e.g., MSE) inherently favor global consistency while neglecting instance-level perception and restoration. To address this issue, we propose InstanceRSR, a novel RSR framework that jointly models semantic information and introduces instance-level feature alignment. Specifically, we employ low-resolution (LR) images as global consistency guidance while jointly modeling image data and semantic segmentation maps to enforce semantic relevance during sampling. Moreover, we design an instance representation learning module to align the diffusion latent space with the instance latent space, enabling instance-aware feature alignment, and further incorporate a scale alignment mechanism to enhance fine-grained perception and detail recovery. Benefiting from these designs, our approach not only generates photorealistic details but also preserves semantic consistency at the instance level. Extensive experiments on multiple real-world benchmarks demonstrate that InstanceRSR significantly outperforms existing methods in both quantitative metrics and visual quality, achieving new state-of-the-art (SOTA) performance.

InstanceRSR: Real-World Super-Resolution via Instance-Aware Representation Alignment

Abstract

Existing real-world super-resolution (RSR) methods based on generative priors have achieved remarkable progress in producing high-quality and globally consistent reconstructions. However, they often struggle to recover fine-grained details of diverse object instances in complex real-world scenes. This limitation primarily arises because commonly adopted denoising losses (e.g., MSE) inherently favor global consistency while neglecting instance-level perception and restoration. To address this issue, we propose InstanceRSR, a novel RSR framework that jointly models semantic information and introduces instance-level feature alignment. Specifically, we employ low-resolution (LR) images as global consistency guidance while jointly modeling image data and semantic segmentation maps to enforce semantic relevance during sampling. Moreover, we design an instance representation learning module to align the diffusion latent space with the instance latent space, enabling instance-aware feature alignment, and further incorporate a scale alignment mechanism to enhance fine-grained perception and detail recovery. Benefiting from these designs, our approach not only generates photorealistic details but also preserves semantic consistency at the instance level. Extensive experiments on multiple real-world benchmarks demonstrate that InstanceRSR significantly outperforms existing methods in both quantitative metrics and visual quality, achieving new state-of-the-art (SOTA) performance.
Paper Structure (10 sections, 6 equations, 4 figures, 2 tables)

This paper contains 10 sections, 6 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: t-SNE visualization of intermediate feature representations comparing StableSR and our method. Each point corresponds to a sample, color-coded by its semantic category.
  • Figure 2: Overview of the proposed InstanceRSR framework. The model integrates instance masks and representation alignment into a DiT-based pipeline, where frozen visual encoders, backbone and semantic guidance jointly enhance instance awareness.
  • Figure 3: Visual comparison on the RealSR dataset. Competing methods tend to produce geometry distortions, over-smoothing, or noisy artifacts. In contrast, our InstanceRSR restores sharp structures and fine textures with clear boundaries and artifact-free details.
  • Figure 4: Ablation study analysis. (a) Representation learning trained with intermediate features $\textbf{f}$ extracted from different layers. (b) Effect of varying sampling steps under the default setting. (c) Impact of representation learning on pre-training efficiency and reconstruction quality.