Table of Contents
Fetching ...

LATINO-PRO: LAtent consisTency INverse sOlver with PRompt Optimization

Alessio Spagnoletti, Jean Prost, Andrés Almansa, Nicolas Papadakis, Marcelo Pereyra

TL;DR

The paper addresses ill-posed inverse imaging problems by using latent diffusion priors encoded as latent consistency models within a Plug & Play Langevin framework. It introduces LATINO, a gradient-free, memory-efficient sampler that leverages a stochastic auto-encoder built from a latent CM to sample from p(x|y,c) with few neural function evaluations, enabling high-resolution reconstructions. Building on this, LATINO-PRO adds prompt self-calibration via maximum marginal likelihood estimation, using a stochastic approximation proximal gradient scheme to refine the text conditioning and improve posterior inference. Empirical results on FFHQ and AFHQ demonstrate state-of-the-art reconstruction quality with substantially fewer NFEs and favorable memory usage compared to existing latent DIS methods, validating the practical impact of zero-shot, prompt-guided inverse solving with CM priors.

Abstract

Text-to-image latent diffusion models (LDMs) have recently emerged as powerful generative models with great potential for solving inverse problems in imaging. However, leveraging such models in a Plug & Play (PnP), zero-shot manner remains challenging because it requires identifying a suitable text prompt for the unknown image of interest. Also, existing text-to-image PnP approaches are highly computationally expensive. We herein address these challenges by proposing a novel PnP inference paradigm specifically designed for embedding generative models within stochastic inverse solvers, with special attention to Latent Consistency Models (LCMs), which distill LDMs into fast generators. We leverage our framework to propose LAtent consisTency INverse sOlver (LATINO), the first zero-shot PnP framework to solve inverse problems with priors encoded by LCMs. Our conditioning mechanism avoids automatic differentiation and reaches SOTA quality in as little as 8 neural function evaluations. As a result, LATINO delivers remarkably accurate solutions and is significantly more memory and computationally efficient than previous approaches. We then embed LATINO within an empirical Bayesian framework that automatically calibrates the text prompt from the observed measurements by marginal maximum likelihood estimation. Extensive experiments show that prompt self-calibration greatly improves estimation, allowing LATINO with PRompt Optimization to define new SOTAs in image reconstruction quality and computational efficiency. The code is available at https://latino-pro.github.io

LATINO-PRO: LAtent consisTency INverse sOlver with PRompt Optimization

TL;DR

The paper addresses ill-posed inverse imaging problems by using latent diffusion priors encoded as latent consistency models within a Plug & Play Langevin framework. It introduces LATINO, a gradient-free, memory-efficient sampler that leverages a stochastic auto-encoder built from a latent CM to sample from p(x|y,c) with few neural function evaluations, enabling high-resolution reconstructions. Building on this, LATINO-PRO adds prompt self-calibration via maximum marginal likelihood estimation, using a stochastic approximation proximal gradient scheme to refine the text conditioning and improve posterior inference. Empirical results on FFHQ and AFHQ demonstrate state-of-the-art reconstruction quality with substantially fewer NFEs and favorable memory usage compared to existing latent DIS methods, validating the practical impact of zero-shot, prompt-guided inverse solving with CM priors.

Abstract

Text-to-image latent diffusion models (LDMs) have recently emerged as powerful generative models with great potential for solving inverse problems in imaging. However, leveraging such models in a Plug & Play (PnP), zero-shot manner remains challenging because it requires identifying a suitable text prompt for the unknown image of interest. Also, existing text-to-image PnP approaches are highly computationally expensive. We herein address these challenges by proposing a novel PnP inference paradigm specifically designed for embedding generative models within stochastic inverse solvers, with special attention to Latent Consistency Models (LCMs), which distill LDMs into fast generators. We leverage our framework to propose LAtent consisTency INverse sOlver (LATINO), the first zero-shot PnP framework to solve inverse problems with priors encoded by LCMs. Our conditioning mechanism avoids automatic differentiation and reaches SOTA quality in as little as 8 neural function evaluations. As a result, LATINO delivers remarkably accurate solutions and is significantly more memory and computationally efficient than previous approaches. We then embed LATINO within an empirical Bayesian framework that automatically calibrates the text prompt from the observed measurements by marginal maximum likelihood estimation. Extensive experiments show that prompt self-calibration greatly improves estimation, allowing LATINO with PRompt Optimization to define new SOTAs in image reconstruction quality and computational efficiency. The code is available at https://latino-pro.github.io

Paper Structure

This paper contains 21 sections, 17 equations, 5 figures, 2 tables, 2 algorithms.

Figures (5)

  • Figure 1: Qualitative comparison of LATINO-PRO on the FFHQ-1024 val dataset. Tasks: $\times 32$ super-resolution, Gaussian deblur $\sigma=20.0$ pixels, Motion deblur.
  • Figure 2: One step of the LATINO solver, a discretization of the Langevin SDE (\ref{['eq:Langevin']}) which targets the posterior $p({\bm{x}}|{\bm{y}}, c)$. The current iterate ${\bm{x}}_k$ is encoded by the VAE encoder and propagated forward via a noising diffusion kernel $p({\bm{z}}_t|{\bm{z}}_0)$. This process is then reversed via the latent consistency model and the VAE decoder, followed by the proximal operator to involve the likelihood $p({\bm{y}}|{\bm{x}})$.
  • Figure 3: SAE applied to images in and out of distribution for different values of $t$, illustrating contraction towards $p({\bm{x}}|c)$.
  • Figure 4: Comparison of image restorations. Samples taken from AFHQ-512. Prompts: a sharp photo of a dog (resp. a cat).
  • Figure 5: Qualitative comparison of image restoration results. Samples taken from FFHQ-512. Prompt: a sharp photo of a face.

Theorems & Definitions (1)

  • Definition 1: Consistency function