Table of Contents
Fetching ...

Optimal Control Meets Flow Matching: A Principled Route to Multi-Subject Fidelity

Eric Tillmann Bill, Enis Simsar, Thomas Hofmann

TL;DR

Multi-subject prompts in text-to-image generation exhibit attribute leakage, identity entanglement, and omissions. The authors formulate flow matching (FM) as a stochastic optimal control (SOC) problem to disentangle subjects, yielding a principled objective and two practical solutions: a test-time controller that perturbatively adjusts the base velocity and Adjoint Matching (AM) fine-tuning that regresses a control network to a backward adjoint signal, complemented by FOCUS, a probabilistic attention loss enforcing localized, nonoverlapping subject attention. The approach unifies prior attention heuristics, extends to diffusion backbones through a flow-diffusion correspondence, and demonstrates consistent improvements in multi-subject fidelity across Stable Diffusion 3.5, FLUX, and SDXL while preserving base style. Test-time control provides fast gains with modest overhead, while fine-tuning delivers stronger, generalizable improvements from small prompts. Together, these contributions establish a principled, architecture-agnostic route to reliable multi-subject composition in T2I models.

Abstract

Text-to-image (T2I) models excel on single-entity prompts but struggle with multi-subject descriptions, often showing attribute leakage, identity entanglement, and subject omissions. We introduce the first theoretical framework with a principled, optimizable objective for steering sampling dynamics toward multi-subject fidelity. Viewing flow matching (FM) through stochastic optimal control (SOC), we formulate subject disentanglement as control over a trained FM sampler. This yields two architecture-agnostic algorithms: (i) a training-free test-time controller that perturbs the base velocity with a single-pass update, and (ii) Adjoint Matching, a lightweight fine-tuning rule that regresses a control network to a backward adjoint signal while preserving base-model capabilities. The same formulation unifies prior attention heuristics, extends to diffusion models via a flow-diffusion correspondence, and provides the first fine-tuning route explicitly designed for multi-subject fidelity. Empirically, on Stable Diffusion 3.5, FLUX, and Stable Diffusion XL, both algorithms consistently improve multi-subject alignment while maintaining base-model style. Test-time control runs efficiently on commodity GPUs, and fine-tuned controllers trained on limited prompts generalize to unseen ones. We further highlight FOCUS (Flow Optimal Control for Unentangled Subjects), which achieves state-of-the-art multi-subject fidelity across models.

Optimal Control Meets Flow Matching: A Principled Route to Multi-Subject Fidelity

TL;DR

Multi-subject prompts in text-to-image generation exhibit attribute leakage, identity entanglement, and omissions. The authors formulate flow matching (FM) as a stochastic optimal control (SOC) problem to disentangle subjects, yielding a principled objective and two practical solutions: a test-time controller that perturbatively adjusts the base velocity and Adjoint Matching (AM) fine-tuning that regresses a control network to a backward adjoint signal, complemented by FOCUS, a probabilistic attention loss enforcing localized, nonoverlapping subject attention. The approach unifies prior attention heuristics, extends to diffusion backbones through a flow-diffusion correspondence, and demonstrates consistent improvements in multi-subject fidelity across Stable Diffusion 3.5, FLUX, and SDXL while preserving base style. Test-time control provides fast gains with modest overhead, while fine-tuning delivers stronger, generalizable improvements from small prompts. Together, these contributions establish a principled, architecture-agnostic route to reliable multi-subject composition in T2I models.

Abstract

Text-to-image (T2I) models excel on single-entity prompts but struggle with multi-subject descriptions, often showing attribute leakage, identity entanglement, and subject omissions. We introduce the first theoretical framework with a principled, optimizable objective for steering sampling dynamics toward multi-subject fidelity. Viewing flow matching (FM) through stochastic optimal control (SOC), we formulate subject disentanglement as control over a trained FM sampler. This yields two architecture-agnostic algorithms: (i) a training-free test-time controller that perturbs the base velocity with a single-pass update, and (ii) Adjoint Matching, a lightweight fine-tuning rule that regresses a control network to a backward adjoint signal while preserving base-model capabilities. The same formulation unifies prior attention heuristics, extends to diffusion models via a flow-diffusion correspondence, and provides the first fine-tuning route explicitly designed for multi-subject fidelity. Empirically, on Stable Diffusion 3.5, FLUX, and Stable Diffusion XL, both algorithms consistently improve multi-subject alignment while maintaining base-model style. Test-time control runs efficiently on commodity GPUs, and fine-tuned controllers trained on limited prompts generalize to unseen ones. We further highlight FOCUS (Flow Optimal Control for Unentangled Subjects), which achieves state-of-the-art multi-subject fidelity across models.

Paper Structure

This paper contains 37 sections, 1 theorem, 38 equations, 12 figures, 7 tables.

Key Result

Lemma B.1

Let $P = \{{\bm{p}}^{(1)}, \dots, {\bm{p}}^{(n)}\} \subset\Delta^{d-1}$ be a set of probability distributions. Then, $D_\mathrm{JS}(P)$ is upper bounded by $\log n$.

Figures (12)

  • Figure 1: Optimal control makes flow matching models reliable on multi-subject prompts. Using FOCUS at test time or via fine-tuning yields faithful multi-subject compositions with correct attributes, minimal leakage, and no omissions, while preserving base style.
  • Figure 2: Extracted cross-attention maps for both subjects in FLUX.1 [dev].
  • Figure 3: Qualitative results with test-time control on Stable Diffusion 3.5 and FLUX.1. Each heuristic is shown at its optimal $\lambda$. Additional examples appear in \ref{['fig:otf_sd3', 'fig:otf_flux']} of the Appendix.
  • Figure 4: Qualitative results after fine-tuning Stable Diffusion 3.5 and FLUX.1. Each heuristic uses its optimal hyperparameters. Additional examples appear in \ref{['fig:fine_sd3', 'fig:fine_flux']} of the Appendix.
  • Figure 5: Transfer to SDXL.
  • ...and 7 more figures

Theorems & Definitions (2)

  • Lemma B.1: Upper Bound of Jensen--Shannon Divergence
  • proof