Table of Contents
Fetching ...

FlowDet: Unifying Object Detection and Generative Transport Flows

Enis Baty, C. P. Bridges, Simon Hadfield

TL;DR

FlowDet reframes object detection as a generative transport problem using Conditional Flow Matching, replacing diffusion-based denoising with near-linear, straight transport paths that enable flexible, runtime-adjustable inference without retraining. By combining a Sparse R-CNN–style architecture with data-derived and data-dependent priors and a time-conditioned detection decoder, FlowDet achieves competitive or superior performance to DiffusionDet across COCO and LVIS, particularly in recall-constrained settings with fewer proposals. The work provides extensive ablations on priors, matching strategies, and ODE solvers, showing that simpler flow trajectories and Euler integration yield strong efficiency gains while maintaining high accuracy. Overall, FlowDet offers a practical, scalable path for diffusion-era detectors to leverage flow-based inference with controllable compute-accuracy trade-offs and broad compatibility with standard detection backbones.

Abstract

We present FlowDet, the first formulation of object detection using modern Conditional Flow Matching techniques. This work follows from DiffusionDet, which originally framed detection as a generative denoising problem in the bounding box space via diffusion. We revisit and generalise this formulation to a broader class of generative transport problems, while maintaining the ability to vary the number of boxes and inference steps without re-training. In contrast to the curved stochastic transport paths induced by diffusion, FlowDet learns simpler and straighter paths resulting in faster scaling of detection performance as the number of inference steps grows. We find that this reformulation enables us to outperform diffusion based detection systems (as well as non-generative baselines) across a wide range of experiments, including various precision/recall operating points using multiple feature backbones and datasets. In particular, when evaluating under recall-constrained settings, we can highlight the effects of the generative transport without over-compensating with large numbers of proposals. This provides gains of up to +3.6% AP and +4.2% AP$_{rare}$ over DiffusionDet on the COCO and LVIS datasets, respectively.

FlowDet: Unifying Object Detection and Generative Transport Flows

TL;DR

FlowDet reframes object detection as a generative transport problem using Conditional Flow Matching, replacing diffusion-based denoising with near-linear, straight transport paths that enable flexible, runtime-adjustable inference without retraining. By combining a Sparse R-CNN–style architecture with data-derived and data-dependent priors and a time-conditioned detection decoder, FlowDet achieves competitive or superior performance to DiffusionDet across COCO and LVIS, particularly in recall-constrained settings with fewer proposals. The work provides extensive ablations on priors, matching strategies, and ODE solvers, showing that simpler flow trajectories and Euler integration yield strong efficiency gains while maintaining high accuracy. Overall, FlowDet offers a practical, scalable path for diffusion-era detectors to leverage flow-based inference with controllable compute-accuracy trade-offs and broad compatibility with standard detection backbones.

Abstract

We present FlowDet, the first formulation of object detection using modern Conditional Flow Matching techniques. This work follows from DiffusionDet, which originally framed detection as a generative denoising problem in the bounding box space via diffusion. We revisit and generalise this formulation to a broader class of generative transport problems, while maintaining the ability to vary the number of boxes and inference steps without re-training. In contrast to the curved stochastic transport paths induced by diffusion, FlowDet learns simpler and straighter paths resulting in faster scaling of detection performance as the number of inference steps grows. We find that this reformulation enables us to outperform diffusion based detection systems (as well as non-generative baselines) across a wide range of experiments, including various precision/recall operating points using multiple feature backbones and datasets. In particular, when evaluating under recall-constrained settings, we can highlight the effects of the generative transport without over-compensating with large numbers of proposals. This provides gains of up to +3.6% AP and +4.2% AP over DiffusionDet on the COCO and LVIS datasets, respectively.

Paper Structure

This paper contains 30 sections, 17 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Object detection paradigms: (a) Traditional detectors are limited at inference time, predicting a single set of boxes without the ability to dynamically iterate or transport distant boxes to better predictions. (b) DiffusionDet diffdet re-framed detection as a denoising process, enabling generative transport at inference time, at the cost of complex stochastic paths. (c) FlowDet generalises this generative formulation to Conditional Flow Matching, providing shorter, straighter transport paths with greater performance using fewer steps.
  • Figure 2: Architecture block diagram of FlowDet.
  • Figure 3: Plot of AP against integration steps between FlowDet and DiffusionDet. We plot AP for $N_{eval} \in \{50, 75, 100\}$.
  • Figure 4: Qualitative examples of FlowDet on COCO. For each image, we show (from left to right): the input with ground-truth boxes (solid) and initial prior samples (faint), followed by the model’s predictions after 1, 2, and 3 integration steps. All examples use the same backbone, ResNet50, and inference settings as our main experiments and $N_{eval} = 100$ per image.