Table of Contents
Fetching ...

The Power of Context: How Multimodality Improves Image Super-Resolution

Kangfu Mei, Hossein Talebi, Mojtaba Ardakani, Vishal M. Patel, Peyman Milanfar, Mauricio Delbracio

TL;DR

This work addresses the limitations of traditional SISR by exploiting rich multimodal context—depth, segmentation, edges, and captions—through a diffusion-based framework. It introduces MMSR, a token-based, unified multimodal conditioning approach with a Multimodal Latent Connector and multimodal classifier-free guidance, enabling per-modality control and reduced hallucinations. Empirical results on DIV2K-Val, RealSR, and related benchmarks show superior perceptual fidelity and broader robustness than state-of-the-art text-guided SR methods, along with interpretable, fine-grained controllability. By reducing uncertainty in the high-resolution reconstruction via multimodal conditioning, MMSR offers a practical path to more realistic and customizable image super-resolution, even when some modalities are missing or imperfect.

Abstract

Single-image super-resolution (SISR) remains challenging due to the inherent difficulty of recovering fine-grained details and preserving perceptual quality from low-resolution inputs. Existing methods often rely on limited image priors, leading to suboptimal results. We propose a novel approach that leverages the rich contextual information available in multiple modalities -- including depth, segmentation, edges, and text prompts -- to learn a powerful generative prior for SISR within a diffusion model framework. We introduce a flexible network architecture that effectively fuses multimodal information, accommodating an arbitrary number of input modalities without requiring significant modifications to the diffusion process. Crucially, we mitigate hallucinations, often introduced by text prompts, by using spatial information from other modalities to guide regional text-based conditioning. Each modality's guidance strength can also be controlled independently, allowing steering outputs toward different directions, such as increasing bokeh through depth or adjusting object prominence via segmentation. Extensive experiments demonstrate that our model surpasses state-of-the-art generative SISR methods, achieving superior visual quality and fidelity. See project page at https://mmsr.kfmei.com/.

The Power of Context: How Multimodality Improves Image Super-Resolution

TL;DR

This work addresses the limitations of traditional SISR by exploiting rich multimodal context—depth, segmentation, edges, and captions—through a diffusion-based framework. It introduces MMSR, a token-based, unified multimodal conditioning approach with a Multimodal Latent Connector and multimodal classifier-free guidance, enabling per-modality control and reduced hallucinations. Empirical results on DIV2K-Val, RealSR, and related benchmarks show superior perceptual fidelity and broader robustness than state-of-the-art text-guided SR methods, along with interpretable, fine-grained controllability. By reducing uncertainty in the high-resolution reconstruction via multimodal conditioning, MMSR offers a practical path to more realistic and customizable image super-resolution, even when some modalities are missing or imperfect.

Abstract

Single-image super-resolution (SISR) remains challenging due to the inherent difficulty of recovering fine-grained details and preserving perceptual quality from low-resolution inputs. Existing methods often rely on limited image priors, leading to suboptimal results. We propose a novel approach that leverages the rich contextual information available in multiple modalities -- including depth, segmentation, edges, and text prompts -- to learn a powerful generative prior for SISR within a diffusion model framework. We introduce a flexible network architecture that effectively fuses multimodal information, accommodating an arbitrary number of input modalities without requiring significant modifications to the diffusion process. Crucially, we mitigate hallucinations, often introduced by text prompts, by using spatial information from other modalities to guide regional text-based conditioning. Each modality's guidance strength can also be controlled independently, allowing steering outputs toward different directions, such as increasing bokeh through depth or adjusting object prominence via segmentation. Extensive experiments demonstrate that our model surpasses state-of-the-art generative SISR methods, achieving superior visual quality and fidelity. See project page at https://mmsr.kfmei.com/.

Paper Structure

This paper contains 12 sections, 5 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Our Multimodal Super-Resolution (MMSR) method leverages the rich context of multimodal guidance, including image captions, depth maps, semantic segmentation maps, and edges inferred from LR. MMSR surpasses state-of-the-art methods by producing more realistic results and suppressing artifacts that, while plausible, are inconsistent with the information present in the LR input.
  • Figure 2: Language models struggle to accurately represent spatial information, leading to coarse and imprecise image super-resolution. To overcome this limitation, we incorporate additional spatial modalities like depth maps and semantic segmentation maps. These modalities provide detailed spatial context, allowing our model to implicitly align language descriptions with individual pixels through a transformer network. This enriched understanding of the image significantly enhances the realism of our super-resolution results and minimizes distortion.
  • Figure 3: This diagram illustrates our multimodal super-resolution pipeline. Starting with a low-resolution (LR) image, we extract modalities like depth and semantic segmentation maps. These modalities are encoded into tokens and transformed into multimodal latent tokens ($m$). Our diffusion model uses these tokens and the LR input to generate a high-resolution (SR) output. A multimodal classifier-free guidance (m-cfg) refines the SR image for enhanced quality.
  • Figure 4: Using discrete multimodal tokens leads to superior reconstruction of modalities compared to continuous tokens.
  • Figure 5: MMSR super-resolution results on real-world images compared with state-of-the-art methods. Zoom in to appreciate the details.
  • ...and 4 more figures