Table of Contents
Fetching ...

BiFM: Bidirectional Flow Matching for Few-Step Image Editing and Generation

Yasong Dai, Zeeshan Hayder, David Ahmedt-Aristizabal, Hongdong Li

Abstract

Recent diffusion and flow matching models have demonstrated strong capabilities in image generation and editing by progressively removing noise through iterative sampling. While this enables flexible inversion for semantic-preserving edits, few-step sampling regimes suffer from poor forward process approximation, leading to degraded editing quality. Existing few-step inversion methods often rely on pretrained generators and auxiliary modules, limiting scalability and generalization across different architectures. To address these limitations, we propose BiFM (Bidirectional Flow Matching), a unified framework that jointly learns generation and inversion within a single model. BiFM directly estimates average velocity fields in both ``image $\to$ noise" and ``noise $\to$ image" directions, constrained by a shared instantaneous velocity field derived from either predefined schedules or pretrained multi-step diffusion models. Additionally, BiFM introduces a novel training strategy using continuous time-interval supervision, stabilized by a bidirectional consistency objective and a lightweight time-interval embedding. This bidirectional formulation also enables one-step inversion and can integrate seamlessly into popular diffusion and flow matching backbones. Across diverse image editing and generation tasks, BiFM consistently outperforms existing few-step approaches, achieving superior performance and editability.

BiFM: Bidirectional Flow Matching for Few-Step Image Editing and Generation

Abstract

Recent diffusion and flow matching models have demonstrated strong capabilities in image generation and editing by progressively removing noise through iterative sampling. While this enables flexible inversion for semantic-preserving edits, few-step sampling regimes suffer from poor forward process approximation, leading to degraded editing quality. Existing few-step inversion methods often rely on pretrained generators and auxiliary modules, limiting scalability and generalization across different architectures. To address these limitations, we propose BiFM (Bidirectional Flow Matching), a unified framework that jointly learns generation and inversion within a single model. BiFM directly estimates average velocity fields in both ``image noise" and ``noise image" directions, constrained by a shared instantaneous velocity field derived from either predefined schedules or pretrained multi-step diffusion models. Additionally, BiFM introduces a novel training strategy using continuous time-interval supervision, stabilized by a bidirectional consistency objective and a lightweight time-interval embedding. This bidirectional formulation also enables one-step inversion and can integrate seamlessly into popular diffusion and flow matching backbones. Across diverse image editing and generation tasks, BiFM consistently outperforms existing few-step approaches, achieving superior performance and editability.

Paper Structure

This paper contains 32 sections, 17 equations, 9 figures, 8 tables, 2 algorithms.

Figures (9)

  • Figure 1: Inversion-Based Image Editing. (a) In training-free inversion, the process is approximated by numerically reversing the generation steps, leading to accumulated approximation errors; (b) An auxiliary inversion network is introduced on top of a pretrained generator, improving fidelity but increasing complexity and reducing generalization across architectures. (c) Our method, BiFM, jointly learns generation and inversion within a single flow matching model, enabling consistent few-step inversion and editing.
  • Figure 2: Overview of BiFM. (a) Our one-step generation architecture built upon MMDiT based flow matching model. (b) A single MMDiT block showing how time embedding modulation impact model output. (c) Naive DDIM inversion reuses DDIM update in reverse time, causing departures from original ODE trajectory in few-step regime. (d) Tuning based inversion introduces an auxiliary network $\Phi({\mathbf{x}}_{t'},t',t)$ (e) BiFM inversion (ours) learns a physically constrained bidirectional average velocity field.
  • Figure 3: Inversion and Reconstruction Quality. From left to right: original input image, PnP Inversion ju2024pnp, RF-Edit wang2025rfedit, and BiFM (ours). BiFM faithfully reconstructs image details, while RF-Edit exhibits semantic shift and PnP Inv fails to recover fine details in the source image.
  • Figure 4: Image Editing Visualization. Given a source image, a source prompt and a target prompt (left illustrates difference between source prompt and target prompt), BiFM generates edits which follow more faithfully the intended concept while better preserving the original layout and fine details than other baselines. For example, BiFM engraves a clear lion pattern on the latte art without distorting the background, swaps the Statue of Liberty’s torch for a flower without geometry distortion, and maintains the lighthouse structure.
  • Figure A: CIFAR-10 Training Epochs vs. FID
  • ...and 4 more figures