Table of Contents
Fetching ...

Linear combinations of latents in generative models: subspaces and beyond

Erik Bodin, Alexandru Stere, Dragos D. Margineantu, Carl Henrik Ek, Henry Moss

TL;DR

The paper tackles the challenge of controllably manipulating latent variables in generative models by proposing Latent Optimal Linear combinations (LOL), a closed-form transform that maps any linear combination of seed latents to the target latent distribution via a Monge optimal transport map. LOL enables robust interpolation, centroid computation, and the construction of expressive low-dimensional latent subspaces in a model-agnostic fashion, contingent on seed latents passing distribution tests that assess their alignment with the latent prior. Empirically, LOL outperforms or matches baseline interpolation methods in semantic preservation with significantly faster runtimes and demonstrates model-agnostic subspace construction across diffusion and flow-matching models. The work highlights the importance of distributional compatibility of seed latents and opens avenues for tailoring distribution tests and extending LOL to broader latent distributions.

Abstract

Sampling from generative models has become a crucial tool for applications like data synthesis and augmentation. Diffusion, Flow Matching and Continuous Normalising Flows have shown effectiveness across various modalities, and rely on latent variables for generation. For experimental design or creative applications that require more control over the generation process, it has become common to manipulate the latent variable directly. However, existing approaches for performing such manipulations (e.g. interpolation or forming low-dimensional representations) only work well in special cases or are network or data-modality specific. We propose Latent Optimal Linear combinations (LOL) as a general-purpose method to form linear combinations of latent variables that adhere to the assumptions of the generative model. As LOL is easy to implement and naturally addresses the broader task of forming any linear combinations, e.g. the construction of subspaces of the latent space, LOL dramatically simplifies the creation of expressive low-dimensional representations of high-dimensional objects.

Linear combinations of latents in generative models: subspaces and beyond

TL;DR

The paper tackles the challenge of controllably manipulating latent variables in generative models by proposing Latent Optimal Linear combinations (LOL), a closed-form transform that maps any linear combination of seed latents to the target latent distribution via a Monge optimal transport map. LOL enables robust interpolation, centroid computation, and the construction of expressive low-dimensional latent subspaces in a model-agnostic fashion, contingent on seed latents passing distribution tests that assess their alignment with the latent prior. Empirically, LOL outperforms or matches baseline interpolation methods in semantic preservation with significantly faster runtimes and demonstrates model-agnostic subspace construction across diffusion and flow-matching models. The work highlights the importance of distributional compatibility of seed latents and opens avenues for tailoring distribution tests and extending LOL to broader latent distributions.

Abstract

Sampling from generative models has become a crucial tool for applications like data synthesis and augmentation. Diffusion, Flow Matching and Continuous Normalising Flows have shown effectiveness across various modalities, and rely on latent variables for generation. For experimental design or creative applications that require more control over the generation process, it has become common to manipulate the latent variable directly. However, existing approaches for performing such manipulations (e.g. interpolation or forming low-dimensional representations) only work well in special cases or are network or data-modality specific. We propose Latent Optimal Linear combinations (LOL) as a general-purpose method to form linear combinations of latent variables that adhere to the assumptions of the generative model. As LOL is easy to implement and naturally addresses the broader task of forming any linear combinations, e.g. the construction of subspaces of the latent space, LOL dramatically simplifies the creation of expressive low-dimensional representations of high-dimensional objects.
Paper Structure (27 sections, 8 theorems, 66 equations, 18 figures, 4 tables)

This paper contains 27 sections, 8 theorems, 66 equations, 18 figures, 4 tables.

Key Result

Lemma 1

Let $\bm{z}$ be a transformed variable defined through the transport map $\mathcal{T}_{\{w\}}$ applied to a linear combination of independent latent variables $\bm{x}_k \sim p(\bm{x})$, such that where Then, $\bm{z} \sim p(\bm{x})$, meaning that the transformed variable follows the target latent distribution.

Figures (18)

  • Figure 1: Low-dimensional latent subspaces. A 5-dimensional subspace from the flow matching model Stable Diffusion 3 esser2024scaling extracted using LOL (left) from the latents corresponding to images $\bm{x}_1, \dots, \bm{x}_5$. The left plot show generations from uniform grid points across an axis-aligned slice of the subspace coordinate system, centered around the coordinate for $\bm{x}_1$. Each coordinate in the subspace correspond to a linear combination of latents, which define basis vectors. The right plot shows the corresponding subspace without the proposed LOL transformation. See Figure \ref{['fig:rocking_chair']}, Figure \ref{['fig:shapenet_subspace']} and Section \ref{['sec:additional_qual']} in the appendix for additional examples.
  • Figure 2: Centroid determination. Generation using Stable Diffusion 2.1 rombach2022high from the centroid of the latents corresponding to images $\bm{x}_1$, $\bm{x}_2$, $\bm{x}_3$ using different methods. Note that our proposed method removes several artifacts, such as unrealistic headlights and chassi texture.
  • Figure 3: Likelihood and norm insufficient. Column (1) of each panel shows an image generated using a random sample from the associated Gaussian latent distribution for the diffusion model of rombach2022high (left side) and the flow matching model of esser2024scaling (right side). Columns (2) and (3) both show images generated from latents with the most likely norm according to their respective latent distribution. Columns (2) use the same Gaussian samples as in columns (1) but rescaled to have this norm, also yielding realistic images. Meanwhile, columns (3) show the failed generation from constant vectors $s\bm{I}$ scaled to have the most likely norm according to the latent distribution but lacking other characteristics (e.g. not having all-equal values) that the network was trained to expect, even though its likelihood $\mathcal{N}(s\bm{I}; \bm{\mu}, \bm{\Sigma})$ is typical of real samples. Moreover, the distribution mode $\bm{\mu}$, which also lacks needed characteristics , has vastly higher log likelihood than any realistic sample; -33875 and -135503 for the two models, respectively. See Table \ref{['table:test_cases']} for an example of failed generation using the mode $\bm{\mu}$.
  • Figure 4: Normality testing of latent vectors obtained from inversion. We show the LPIPS zhang2018unreasonable reconstruction errors (second row, in red) of 200 inverted randomly selected images across 50 random classes from ImageNet1k deng2009imagenet, the p-values of their inversions (third row), and rejection rates (bottom row) of the Kolmogorov-Smirnov normality test applied to the corresponding latent obtained from inversion under various step budgets. We use the diffusion model of rombach2022high, always using its maximum number of steps (999) for generation, and denote the 10th, 50th and 90th percentiles with black lines. The first row shows image reconstructions using its inversion at each budget, highlighted in red when the latent was rejected ($p<1e^{-3}$), with the interpretation that the characteristics of the latent were unlikely for a real sample according to the KS test. We note the strong correlation between inversion budgets providing low reconstruction errors and those for which the p-values of the latents are realistic --- taking values likely to occur by chance for real samples. However, as we will see in Figure \ref{['fig:inversion_reconstructions_not_enough']}, there are still many latents with low reconstruction error yet extremely low p-value, and this often severely affects the quality of its interpolants.
  • Figure 5: Lack of normality is linked to failure of interpolants The left and middle panels shows LPIPS zhang2018unreasonable reconstruction error (after 999 generation steps) and the Kolmogorov-Smirnov p-value for all inversions presented in Figure \ref{['fig:inversion_reconstructions']}, split into two plots due to the vast dynamic range of p-values. Although latents with high (realistic) p-values tend to have low reconstruction errors, there are many latents with low reconstruction errors that also have low p-values. The right panel shows Q-Align visual quality scores wu2023q for spherical interpolants between pairs of inversions selected from matching ImageNet1k deng2009imagenet image classes, demonstrating that choosing seed latents with both low reconstruction error ($<0.05$) and high p-values ($>1e^{-3}$) allows us to avoid low-quality interpolants that would arise when choosing seeds by reconstruction error alone. For reference, we include examples of interpolants at each visual quality level.
  • ...and 13 more figures

Theorems & Definitions (16)

  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • Lemma 3
  • proof
  • Lemma 4
  • proof
  • Lemma 5
  • proof
  • ...and 6 more