Generative 6D Pose Estimation via Conditional Flow Matching

Amir Hamza; Davide Boscaini; Weihang Li; Benjamin Busam; Fabio Poiesi

Generative 6D Pose Estimation via Conditional Flow Matching

Amir Hamza, Davide Boscaini, Weihang Li, Benjamin Busam, Fabio Poiesi

TL;DR

This work tackles instance-level 6D pose estimation under challenging conditions like object symmetries and occlusions. It reframes the problem as conditional flow matching in $\,mathbb{R}^3$ and introduces Flose, a three-stage pipeline that fuses overlap-aware geometry with semantic features from a Vision Foundation Model (DINOv2) to condition a denoising flow, followed by RANSAC-based registration and ICP refinement. The approach achieves state-of-the-art AR gains on five BOP datasets, including a notable +4.5 AR improvement over strong per-dataset baselines, while reducing training and inference costs compared to per-object models. By coupling appearance cues with robust outlier filtering, Flose demonstrates improved robustness to symmetries and occlusions and offers a controllable accuracy–efficiency trade-off via the number of denoising steps.

Abstract

Existing methods for instance-level 6D pose estimation typically rely on neural networks that either directly regress the pose in $\mathrm{SE}(3)$ or estimate it indirectly via local feature matching. The former struggle with object symmetries, while the latter fail in the absence of distinctive local features. To overcome these limitations, we propose a novel formulation of 6D pose estimation as a conditional flow matching problem in $\mathbb{R}^3$. We introduce Flose, a generative method that infers object poses via a denoising process conditioned on local features. While prior approaches based on conditional flow matching perform denoising solely based on geometric guidance, Flose integrates appearance-based semantic features to mitigate ambiguities caused by object symmetries. We further incorporate RANSAC-based registration to handle outliers. We validate Flose on five datasets from the established BOP benchmark. Flose outperforms prior methods with an average improvement of +4.5 Average Recall. Project Website : https://tev-fbk.github.io/Flose/

Generative 6D Pose Estimation via Conditional Flow Matching

TL;DR

This work tackles instance-level 6D pose estimation under challenging conditions like object symmetries and occlusions. It reframes the problem as conditional flow matching in

and introduces Flose, a three-stage pipeline that fuses overlap-aware geometry with semantic features from a Vision Foundation Model (DINOv2) to condition a denoising flow, followed by RANSAC-based registration and ICP refinement. The approach achieves state-of-the-art AR gains on five BOP datasets, including a notable +4.5 AR improvement over strong per-dataset baselines, while reducing training and inference costs compared to per-object models. By coupling appearance cues with robust outlier filtering, Flose demonstrates improved robustness to symmetries and occlusions and offers a controllable accuracy–efficiency trade-off via the number of denoising steps.

Abstract

Existing methods for instance-level 6D pose estimation typically rely on neural networks that either directly regress the pose in

or estimate it indirectly via local feature matching. The former struggle with object symmetries, while the latter fail in the absence of distinctive local features. To overcome these limitations, we propose a novel formulation of 6D pose estimation as a conditional flow matching problem in

. We introduce Flose, a generative method that infers object poses via a denoising process conditioned on local features. While prior approaches based on conditional flow matching perform denoising solely based on geometric guidance, Flose integrates appearance-based semantic features to mitigate ambiguities caused by object symmetries. We further incorporate RANSAC-based registration to handle outliers. We validate Flose on five datasets from the established BOP benchmark. Flose outperforms prior methods with an average improvement of +4.5 Average Recall. Project Website : https://tev-fbk.github.io/Flose/

Paper Structure (5 sections, 4 figures, 1 table)

This paper contains 5 sections, 4 figures, 1 table.

Introduction
Related work
Method
Experiments
Conclusions

Figures (4)

Figure 1: Overview of Flose. Given the query object point cloud $\mathcal{Q}$ and an RGBD image $\textbf{I}$ as input (left), Flose estimates the 6D pose $(\hat{\textbf{R}}, \hat{\textbf{t}})$ (bottom right) through three stages: feature encoding (red), generative denoising (blue) and pose estimation (green). Feature encoding: an overlap-aware encoder $\Phi_\Theta$ and appearance-aware encoder $\Gamma$ produce per-point descriptors, that are fused via a feature fusion to produce $\textbf{F}^\mathcal{Q}, \textbf{F}^\mathcal{T}$. Colors encode feature similarity: corresponding regions share similar colors (overlap-awareness), while semantically distinct parts differ (appearance-awareness). Generative denoising: a generative network $\Psi_\Omega$, conditioned on $\textbf{F}^\mathcal{Q}, \textbf{F}^\mathcal{T}$, learns a displacement field that iteratively denoise the Gaussian noised X(1) into an aligned position X(0). Pose estimation: the 6D pose $\hat{\textbf{R}}, \hat{\textbf{t}}$ is recovered via RANSAC-based Kabsch solver followed by ICP refinement.
Figure 2: Sensitivity of the Inlier Ratio (IR) to the spatial threshold $\tau$. IR is the percentage of points in $\hat{\mathcal{T}}$ whose distance to the corresponding points in $\mathcal{T}^r$ is smaller than $\tau$. At strict thresholds, the majority ($>80\%$) of correspondences are outliers.
Figure 3: Qualitative comparison of Flose (center) vs. an RPF-based sun2025rpf baseline adapted for pose estimation (right). By integrating semantic features and outlier-robust registration, Flose predicts more accurate poses under severe occlusions (rows 1-2) and resolves symmetry ambiguities where pure geometric methods fail (rows 3-4).
Figure 4: Ablation study on LM-O: (a) Impact of the conditioning features on the flow matching process, measured in terms of AR (top) and IR gain relative to a baseline using only overlap-aware features (bottom); (b) Comparison of pose solvers (SVD vs. RANSAC) and the effect of ICP-based refinement. Hyperparameter study on LM-O: (c) Pose accuracy and inference time as a function of Euler integration steps.

Generative 6D Pose Estimation via Conditional Flow Matching

TL;DR

Abstract

Generative 6D Pose Estimation via Conditional Flow Matching

Authors

TL;DR

Abstract

Table of Contents

Figures (4)