Table of Contents
Fetching ...

Fine-Tuning Flow Matching via Maximum Likelihood Estimation of Reconstructions

Zhaoyi Li, Jingtao Ding, Yong Li, Shihua Li

TL;DR

The paper tackles the training–inference gap in Flow Matching (FM) by introducing Maximum Likelihood Estimation (MLE) of reconstructions to fine-tune FM directly on reconstruction errors, leveraging FM's smooth ODE formulation. It derives a theoretical link between training loss and inference error under Lipschitz conditions and proposes both straightforward MLE fine-tuning and a residual variant that enforces contraction and ISS through dedicated network architectures. The proposed methods show improved inference performance in both image generation (FID improvements on CIFAR-10) and robotic manipulation tasks (higher success rates on Push-T, Franka Kitchen, and Robomimic), with evidence that contraction-aware designs yield better robustness. Collectively, the work advances precise, robust flow-based generation by aligning training objectives with inference outcomes and incorporating stability guarantees via contraction analysis, with potential benefits for single-step inference and interpretable latent dynamics.

Abstract

Flow Matching (FM) algorithm achieves remarkable results in generative tasks especially in robotic manipulation. Building upon the foundations of diffusion models, the simulation-free paradigm of FM enables simple and efficient training, but inherently introduces a train-inference gap. Specifically, we cannot assess the model's output during the training phase. In contrast, other generative models including Variational Autoencoder (VAE), Normalizing Flow and Generative Adversarial Networks (GANs) directly optimize on the reconstruction loss. Such a gap is particularly evident in scenarios that demand high precision, such as robotic manipulation. Moreover, we show that FM's over-pursuit of straight predefined paths may introduce some serious problems such as stiffness into the system. These motivate us to fine-tune FM via Maximum Likelihood Estimation of reconstructions - an approach made feasible by FM's underlying smooth ODE formulation, in contrast to the stochastic differential equations (SDEs) used in diffusion models. This paper first theoretically analyzes the relation between training loss and inference error in FM. Then we propose a method of fine-tuning FM via Maximum Likelihood Estimation of reconstructions, which includes both straightforward fine-tuning and residual-based fine-tuning approaches. Furthermore, through specifically designed architectures, the residual-based fine-tuning can incorporate the contraction property into the model, which is crucial for the model's robustness and interpretability. Experimental results in image generation and robotic manipulation verify that our method reliably improves the inference performance of FM.

Fine-Tuning Flow Matching via Maximum Likelihood Estimation of Reconstructions

TL;DR

The paper tackles the training–inference gap in Flow Matching (FM) by introducing Maximum Likelihood Estimation (MLE) of reconstructions to fine-tune FM directly on reconstruction errors, leveraging FM's smooth ODE formulation. It derives a theoretical link between training loss and inference error under Lipschitz conditions and proposes both straightforward MLE fine-tuning and a residual variant that enforces contraction and ISS through dedicated network architectures. The proposed methods show improved inference performance in both image generation (FID improvements on CIFAR-10) and robotic manipulation tasks (higher success rates on Push-T, Franka Kitchen, and Robomimic), with evidence that contraction-aware designs yield better robustness. Collectively, the work advances precise, robust flow-based generation by aligning training objectives with inference outcomes and incorporating stability guarantees via contraction analysis, with potential benefits for single-step inference and interpretable latent dynamics.

Abstract

Flow Matching (FM) algorithm achieves remarkable results in generative tasks especially in robotic manipulation. Building upon the foundations of diffusion models, the simulation-free paradigm of FM enables simple and efficient training, but inherently introduces a train-inference gap. Specifically, we cannot assess the model's output during the training phase. In contrast, other generative models including Variational Autoencoder (VAE), Normalizing Flow and Generative Adversarial Networks (GANs) directly optimize on the reconstruction loss. Such a gap is particularly evident in scenarios that demand high precision, such as robotic manipulation. Moreover, we show that FM's over-pursuit of straight predefined paths may introduce some serious problems such as stiffness into the system. These motivate us to fine-tune FM via Maximum Likelihood Estimation of reconstructions - an approach made feasible by FM's underlying smooth ODE formulation, in contrast to the stochastic differential equations (SDEs) used in diffusion models. This paper first theoretically analyzes the relation between training loss and inference error in FM. Then we propose a method of fine-tuning FM via Maximum Likelihood Estimation of reconstructions, which includes both straightforward fine-tuning and residual-based fine-tuning approaches. Furthermore, through specifically designed architectures, the residual-based fine-tuning can incorporate the contraction property into the model, which is crucial for the model's robustness and interpretability. Experimental results in image generation and robotic manipulation verify that our method reliably improves the inference performance of FM.

Paper Structure

This paper contains 27 sections, 6 theorems, 33 equations, 4 figures, 4 tables.

Key Result

Theorem 1

Assume that the truth vector field $u_t(x)$ is a Lipschitz-continuous function with the Lipschitz constant $L_u$ > 0. And the discrepancy between the learned vector field and the truth satisfies $\|v_{\theta}(t,x)-u_t(x) \|_{\infty} \le \delta$, then we can derive the following error estimate betwee where $M = \max\limits_{0 \leq t \leq 1} |\ddot{\psi_t}|$ is an upper bound for the second time der

Figures (4)

  • Figure 1: Fine-Tuning Your Flow: A Visual Explanation. These figures plot trajectories under different vector field models. (a) illustrates that over-pursuing straightness (blue lines) leads to discontinuities in the vector field, i.e., $f(0^+,0)\neq f(0^-,0)$. This will cause the system to exhibit stiffness, significantly exacerbating the difficulty of numerical solution and thereby compromising the model's reliability. By comparison, the green line depicts a more stable flow. (b) plots the pre-trained FM path (blue line), the fine-tuned flow (green line), the flow fine-tuned with residuals (red line), and the flow trained entirely with MLE (purple line). An oversimplified assumption like "the straight path" can lead to underfitting (blue line). A fine-tuned model converges to a local optimum near the pre-trained model, thereby improving its fit to the sample while maintaining path simplicity (green line). By contrast, the CNF (purple line) solely fits the samples without any prior guidance on the vector field's shape, which can easily produce overly complex trajectories. The red line utilizes a residual fine-tuning approach that preserves the pre-trained model unchanged, employing solely a residual network to learn the remaining residual components. (c) plots the variation curve of component $x_1$ over time. Here, compared with the blue line, the green one represents a "contracting" trajectory lohmiller1998contraction. When subjected to a minor disturbance $d_0$ (which may arise from stochastic noise or slight differences in external inputs), a contracting trajectory still tends to stabilize around a similar solution. Such a model demonstrates superior performance in terms of stability and robustness. (d) illustrates contraction in 2-dimensional space. (Blue) points in different contraction regions will converge to different (red) destinations. Points in the same contraction region behave stably and robustly.
  • Figure 2: Visualization of Experimental Results. (a) plots the FID of FM (blue curve) under different training steps. The fine-tuned The red line is the fine-tuned FID score from checkpoint at $3.5\times10^5$ step under the same time consumption. (b) plots the success rate (SR) of the fine-tuned policy from checkpoint at $4000$ epoch. The red vertical bars represent the variance of SR, which gradually decreases during the training process. (c) shows the task of turning on the left burner in the Franka Kitchen environment, executed by our fine-tuned FM policy within the MuJoCo simulator.
  • Figure E.1: Different Simulation Environments in MuJoCo or Gym.
  • Figure E.2: Visualization Results of generated images under the same initial noise points.

Theorems & Definitions (15)

  • Theorem 1
  • Remark 1
  • Remark 2
  • Theorem 2
  • Theorem 3
  • Theorem 4
  • Definition A.1
  • Proof 1
  • Lemma 1
  • Proof 1: Proof.
  • ...and 5 more