Fine-Tuning Flow Matching via Maximum Likelihood Estimation of Reconstructions
Zhaoyi Li, Jingtao Ding, Yong Li, Shihua Li
TL;DR
The paper tackles the training–inference gap in Flow Matching (FM) by introducing Maximum Likelihood Estimation (MLE) of reconstructions to fine-tune FM directly on reconstruction errors, leveraging FM's smooth ODE formulation. It derives a theoretical link between training loss and inference error under Lipschitz conditions and proposes both straightforward MLE fine-tuning and a residual variant that enforces contraction and ISS through dedicated network architectures. The proposed methods show improved inference performance in both image generation (FID improvements on CIFAR-10) and robotic manipulation tasks (higher success rates on Push-T, Franka Kitchen, and Robomimic), with evidence that contraction-aware designs yield better robustness. Collectively, the work advances precise, robust flow-based generation by aligning training objectives with inference outcomes and incorporating stability guarantees via contraction analysis, with potential benefits for single-step inference and interpretable latent dynamics.
Abstract
Flow Matching (FM) algorithm achieves remarkable results in generative tasks especially in robotic manipulation. Building upon the foundations of diffusion models, the simulation-free paradigm of FM enables simple and efficient training, but inherently introduces a train-inference gap. Specifically, we cannot assess the model's output during the training phase. In contrast, other generative models including Variational Autoencoder (VAE), Normalizing Flow and Generative Adversarial Networks (GANs) directly optimize on the reconstruction loss. Such a gap is particularly evident in scenarios that demand high precision, such as robotic manipulation. Moreover, we show that FM's over-pursuit of straight predefined paths may introduce some serious problems such as stiffness into the system. These motivate us to fine-tune FM via Maximum Likelihood Estimation of reconstructions - an approach made feasible by FM's underlying smooth ODE formulation, in contrast to the stochastic differential equations (SDEs) used in diffusion models. This paper first theoretically analyzes the relation between training loss and inference error in FM. Then we propose a method of fine-tuning FM via Maximum Likelihood Estimation of reconstructions, which includes both straightforward fine-tuning and residual-based fine-tuning approaches. Furthermore, through specifically designed architectures, the residual-based fine-tuning can incorporate the contraction property into the model, which is crucial for the model's robustness and interpretability. Experimental results in image generation and robotic manipulation verify that our method reliably improves the inference performance of FM.
