Table of Contents
Fetching ...

Diffusing Differentiable Representations

Yash Savani, Marc Finzi, J. Zico Kolter

TL;DR

The paper tackles sampling differentiable representations (diffreps) with pretrained diffusion models in a training-free setting. It reframes diffusion sampling by pulling back the reverse-time dynamics to the diffrep parameter space, deriving a correct PF-ODE that includes a (JᵀJ)⁻¹ term and integrating a consistency constraint via RePaint to maintain renderability across views. This approach yields true sampling rather than mode-seeking, delivering higher detail and diversity for images, panoramas, and NeRFs while offering competitive runtimes. The work extends diffusion-model capabilities to multi-view and 3D-conditional generation, with strong empirical results and a clear pathway for future enhancements and broader applicability.

Abstract

We introduce a novel, training-free method for sampling differentiable representations (diffreps) using pretrained diffusion models. Rather than merely mode-seeking, our method achieves sampling by "pulling back" the dynamics of the reverse-time process--from the image space to the diffrep parameter space--and updating the parameters according to this pulled-back process. We identify an implicit constraint on the samples induced by the diffrep and demonstrate that addressing this constraint significantly improves the consistency and detail of the generated objects. Our method yields diffreps with substantially improved quality and diversity for images, panoramas, and 3D NeRFs compared to existing techniques. Our approach is a general-purpose method for sampling diffreps, expanding the scope of problems that diffusion models can tackle.

Diffusing Differentiable Representations

TL;DR

The paper tackles sampling differentiable representations (diffreps) with pretrained diffusion models in a training-free setting. It reframes diffusion sampling by pulling back the reverse-time dynamics to the diffrep parameter space, deriving a correct PF-ODE that includes a (JᵀJ)⁻¹ term and integrating a consistency constraint via RePaint to maintain renderability across views. This approach yields true sampling rather than mode-seeking, delivering higher detail and diversity for images, panoramas, and NeRFs while offering competitive runtimes. The work extends diffusion-model capabilities to multi-view and 3D-conditional generation, with strong empirical results and a clear pathway for future enhancements and broader applicability.

Abstract

We introduce a novel, training-free method for sampling differentiable representations (diffreps) using pretrained diffusion models. Rather than merely mode-seeking, our method achieves sampling by "pulling back" the dynamics of the reverse-time process--from the image space to the diffrep parameter space--and updating the parameters according to this pulled-back process. We identify an implicit constraint on the samples induced by the diffrep and demonstrate that addressing this constraint significantly improves the consistency and detail of the generated objects. Our method yields diffreps with substantially improved quality and diversity for images, panoramas, and 3D NeRFs compared to existing techniques. Our approach is a general-purpose method for sampling diffreps, expanding the scope of problems that diffusion models can tackle.

Paper Structure

This paper contains 31 sections, 1 theorem, 21 equations, 7 figures, 1 table, 1 algorithm.

Key Result

Theorem B.1

Eq. thm1 is the correct forward update for the non-Markovian process.

Figures (7)

  • Figure 1: (Left) Commutative diagram showing how the PF-ODE vector field gets pulled back through $f$, respecting the differential geometry. The process involves: ① converting $\frac{dx}{dt}$ to the cotangent vector field $\nabla \log p(x)$ (up to scaling terms) with the Euclidean metric $I$, ② pulling back $\nabla \log p(x)$ via the chain rule using the Jacobian $J$, and then ③ transforming the pulled back differential form score function into the corresponding vector field using the inverse of pulled back metric $(J^\top J)^{-1}$. When used in a PF-ODE, SJC and SDS take the bottom path with the chain rule, however they do not complete the path by neglecting the $T^*\Theta \rightarrow T\Theta$ transformation. (Right) SIREN image renders generated using the PF-ODE schedule with the prompt "An astronaut riding a horse" using the: (\ref{['fig:our_method']}) complete pulled-back $\frac{dx}{dt}$ vector field, (\ref{['fig:sjc_method']}) pulled-back covector field from SJC (omitting step ③) $J^\top \nabla \log p(x)$, (\ref{['fig:scaled_sjc']}) Scaled pulled-back covector field from SJC $\lambda=0.0001$.
  • Figure 2: The parameters of the diffrep $\theta_t \in \Theta$ (torus) are used to render the noiseless signal $f(\theta_t) = \widehat{x}_0(t)$, which are then combined with the noise $\sigma(t)\epsilon$ to generate the noisy sample $x(t)$. We can pull back each step of the reverse diffusion process to update the parameters $\theta_t + \Delta\theta_t$.
  • Figure 3: The left figure contains sample renders using the prompt "A woman is standing at a crosswalk at a traffic intersection." from the reference SD (top), our method (middle), and SJC (bottom) over the CFG scales [0,3,10,30,100] from left to right. The right plot is the KID metric (closer to 0 is better) measured on the SIRENs sampled from our method, the SD reference samples, and the SIRENs sampled using SJC.
  • Figure 4: Samples generated using SD ref (top), Our method (middle), and SJC (bottom) using the same prompt "An office cubicle with four different types of computers" with eight different seeds.
  • Figure 5: Comparison of landscape panoramas sampled using our method. In each pair, the top panorama is sampled using the RePaint method, while the bottom is sampled without RePaint. Both approaches use 460 function evaluations (NFEs) to ensure fairness. The top pair uses CFG scale 3.0, and the bottom pair uses CFG scale 10.0. The prompt for these panoramas was "Landscape picture of a mountain range in the background with an empty plain in the foreground 50mm f/1.8".
  • ...and 2 more figures

Theorems & Definitions (2)

  • Theorem B.1
  • proof