Table of Contents
Fetching ...

Align & Invert: Solving Inverse Problems with Diffusion and Flow-based Models via Representational Alignment

Loukas Sfountouris, Giannis Daras, Paris Giampouras

TL;DR

The paper tackles inverse problems by introducing REPA, an inference-time regularizer that aligns the internal representations of diffusion/flow models with a pretrained DINOv2 encoder via a proxy reconstruction. It formalizes the REPA term as a gradient based on patch-wise cosine similarities and links it to a divergence measure in the DINOv2 space, with a contraction property on the model's internal representations toward the clean image. Theoretical results show REPA acts as an $MMD$ surrogate in DINOv2 space and yields a contraction bound on diffusion states, providing perceptual benefits. Empirically, REPA improves perceptual fidelity across super-resolution, inpainting, and deblurring tasks, while also offering efficiency gains by reducing discretization steps and integrating smoothly with multiple inverse-problem solvers.

Abstract

Enforcing alignment between the internal representations of diffusion or flow-based generative models and those of pretrained self-supervised encoders has recently been shown to provide a powerful inductive bias, improving both convergence and sample quality. In this work, we extend this idea to inverse problems, where pretrained generative models are employed as priors. We propose applying representation alignment (REPA) between diffusion or flow-based models and a pretrained self-supervised visual encoder, such as DINOv2, to guide the reconstruction process at inference time. Although ground-truth signals are unavailable in inverse problems, we show that aligning model representations with approximate target features can substantially enhance reconstruction fidelity and perceptual realism. We provide theoretical results showing (a) the relation between the REPA regularization and a divergence measure in the DINOv2 embedding space, and (b) how REPA updates steer the model's internal representations toward those of the clean image. These results offer insights into the role of REPA in improving perceptual fidelity. Finally, we demonstrate the generality of our approach by integrating it into multiple state-of-the-art inverse problem solvers. Extensive experiments on super-resolution, box inpainting, Gaussian deblurring, and motion deblurring confirm that our method consistently improves reconstruction quality across tasks, while also providing substantial efficiency gains by reducing the number of required discretization steps without compromising the performance of the underlying solver.

Align & Invert: Solving Inverse Problems with Diffusion and Flow-based Models via Representational Alignment

TL;DR

The paper tackles inverse problems by introducing REPA, an inference-time regularizer that aligns the internal representations of diffusion/flow models with a pretrained DINOv2 encoder via a proxy reconstruction. It formalizes the REPA term as a gradient based on patch-wise cosine similarities and links it to a divergence measure in the DINOv2 space, with a contraction property on the model's internal representations toward the clean image. Theoretical results show REPA acts as an surrogate in DINOv2 space and yields a contraction bound on diffusion states, providing perceptual benefits. Empirically, REPA improves perceptual fidelity across super-resolution, inpainting, and deblurring tasks, while also offering efficiency gains by reducing discretization steps and integrating smoothly with multiple inverse-problem solvers.

Abstract

Enforcing alignment between the internal representations of diffusion or flow-based generative models and those of pretrained self-supervised encoders has recently been shown to provide a powerful inductive bias, improving both convergence and sample quality. In this work, we extend this idea to inverse problems, where pretrained generative models are employed as priors. We propose applying representation alignment (REPA) between diffusion or flow-based models and a pretrained self-supervised visual encoder, such as DINOv2, to guide the reconstruction process at inference time. Although ground-truth signals are unavailable in inverse problems, we show that aligning model representations with approximate target features can substantially enhance reconstruction fidelity and perceptual realism. We provide theoretical results showing (a) the relation between the REPA regularization and a divergence measure in the DINOv2 embedding space, and (b) how REPA updates steer the model's internal representations toward those of the clean image. These results offer insights into the role of REPA in improving perceptual fidelity. Finally, we demonstrate the generality of our approach by integrating it into multiple state-of-the-art inverse problem solvers. Extensive experiments on super-resolution, box inpainting, Gaussian deblurring, and motion deblurring confirm that our method consistently improves reconstruction quality across tasks, while also providing substantial efficiency gains by reducing the number of required discretization steps without compromising the performance of the underlying solver.

Paper Structure

This paper contains 18 sections, 4 theorems, 71 equations, 11 figures, 5 tables, 1 algorithm.

Key Result

Proposition 1

Assume $x \sim p_X$ is a ground-truth image and $\hat{x} \sim p_{\hat{X}\mid Y}$ is its corresponding reconstruction, where $p_{\hat{X}\mid Y}$ is the distribution induced by any reconstruction method. Assume further that all DINOv2 feature embeddings are $\ell_2$–normalized. Let $\bar{x}$ be a prox and the empirical mean DINOv2 feature embedding Then the expected $\textsc{Repa}$ alignment satisf

Figures (11)

  • Figure 1: Overview of our proposed framework. Left: Box inpainting (top row) and Gaussian deblurring (bottom row) results, where adding $\textsc{Repa}$ (third column) improves the perceptual quality of reconstructed images compared to the baseline without $\textsc{Repa}$ (second column). Right: Alignment mechanism between diffusion-model features and pretrained DINOv2 embeddings, combined with measurement matching. An approximate reconstructed image is provided as input to the DINOv2 encoder for computing the alignment signal.
  • Figure 2: Qualitative comparison of Latent DPS with and without the proposed REPA regularizer on two inverse problems: (a) box inpainting and (b) Gaussian deblurring.
  • Figure 3: Comparison of LPIPS as a function of discretization steps for Latent DPS and Latent DPS + REPA. Using our regularizer, comparable performance is achieved with substantially fewer number of steps.
  • Figure 4: Qualitative comparison for motion deblurring. Each column shows the measurement, DPS, Latent DAPS, ReSample + REPA, and the ground-truth image.
  • Figure 5: Similarity of representations under increasing levels of corruption.
  • ...and 6 more figures

Theorems & Definitions (7)

  • Definition 1: Misalignment between DINOv2 and diffusion representations
  • Remark
  • Definition 2: Proxy approximation error
  • Proposition 1
  • Proposition 2
  • Proposition 2
  • Proposition 3