Table of Contents
Fetching ...

SE(3)-PoseFlow: Estimating 6D Pose Distributions for Uncertainty-Aware Robotic Manipulation

Yufeng Jin, Niklas Funk, Vignesh Prasad, Zechu Li, Mathias Franzius, Jan Peters, Georgia Chalvatzaki

TL;DR

This work tackles the challenge of uncertain, multi-modal 6D object pose estimation under occlusions and symmetry by modeling full pose distributions with SE(3) flow matching. The authors introduce a probabilistic pipeline that combines dual-stream RGB-D encoders, DiT* masked cross-attention, and SE(3) velocity-field regression to sample multiple pose hypotheses $p(R,p \mid O,I)$. They propose two pose-selection schemes (model-free clustering and geometry-based re-ranking) and demonstrate how the learned distributions enable active perception and uncertainty-aware grasping in real robotic setups. Empirical results on REAL275, YCB-V, and LM-O show state-of-the-art or competitive performance for probabilistic pose estimation, with ablations highlighting the benefits of masking, RGB cues, and SDF-based scoring. The approach offers practical advantages for safe manipulation in cluttered and ambiguous environments, though future work is needed to scale to multi-object scenes and to integrate Bayesian inference over pose samples.

Abstract

Object pose estimation is a fundamental problem in robotics and computer vision, yet it remains challenging due to partial observability, occlusions, and object symmetries, which inevitably lead to pose ambiguity and multiple hypotheses consistent with the same observation. While deterministic deep networks achieve impressive performance under well-constrained conditions, they are often overconfident and fail to capture the multi-modality of the underlying pose distribution. To address these challenges, we propose a novel probabilistic framework that leverages flow matching on the SE(3) manifold for estimating 6D object pose distributions. Unlike existing methods that regress a single deterministic output, our approach models the full pose distribution with a sample-based estimate and enables reasoning about uncertainty in ambiguous cases such as symmetric objects or severe occlusions. We achieve state-of-the-art results on Real275, YCB-V, and LM-O, and demonstrate how our sample-based pose estimates can be leveraged in downstream robotic manipulation tasks such as active perception for disambiguating uncertain viewpoints or guiding grasp synthesis in an uncertainty-aware manner.

SE(3)-PoseFlow: Estimating 6D Pose Distributions for Uncertainty-Aware Robotic Manipulation

TL;DR

This work tackles the challenge of uncertain, multi-modal 6D object pose estimation under occlusions and symmetry by modeling full pose distributions with SE(3) flow matching. The authors introduce a probabilistic pipeline that combines dual-stream RGB-D encoders, DiT* masked cross-attention, and SE(3) velocity-field regression to sample multiple pose hypotheses . They propose two pose-selection schemes (model-free clustering and geometry-based re-ranking) and demonstrate how the learned distributions enable active perception and uncertainty-aware grasping in real robotic setups. Empirical results on REAL275, YCB-V, and LM-O show state-of-the-art or competitive performance for probabilistic pose estimation, with ablations highlighting the benefits of masking, RGB cues, and SDF-based scoring. The approach offers practical advantages for safe manipulation in cluttered and ambiguous environments, though future work is needed to scale to multi-object scenes and to integrate Bayesian inference over pose samples.

Abstract

Object pose estimation is a fundamental problem in robotics and computer vision, yet it remains challenging due to partial observability, occlusions, and object symmetries, which inevitably lead to pose ambiguity and multiple hypotheses consistent with the same observation. While deterministic deep networks achieve impressive performance under well-constrained conditions, they are often overconfident and fail to capture the multi-modality of the underlying pose distribution. To address these challenges, we propose a novel probabilistic framework that leverages flow matching on the SE(3) manifold for estimating 6D object pose distributions. Unlike existing methods that regress a single deterministic output, our approach models the full pose distribution with a sample-based estimate and enables reasoning about uncertainty in ambiguous cases such as symmetric objects or severe occlusions. We achieve state-of-the-art results on Real275, YCB-V, and LM-O, and demonstrate how our sample-based pose estimates can be leveraged in downstream robotic manipulation tasks such as active perception for disambiguating uncertain viewpoints or guiding grasp synthesis in an uncertainty-aware manner.

Paper Structure

This paper contains 14 sections, 10 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: We propose an uncertainty-aware 6D object pose estimation approach based on $\mathrm{SE}(3)$ flow matching. Our probabilistic framework predicts full 6D pose distributions to handle ambiguities, enabling reliable robotic manipulation under challenging real-world conditions (partial observability, occlusions, and symmetries). SO(3) distributions are visualized on a Mollweide projection, where latitude (pitch) and longitude (roll) map the orientation, and color encodes yaw.
  • Figure 2: Overview of SE(3)-PoseFlow. Given an RGB-D input, we extract object-centric RGB crops and partial point clouds using off-the-shelf detectors. The visual and geometric features, together with timestep and sampled poses, are encoded and fused via $\text{DiT}^{\star}$ blocks with masked cross-attention to predict conditional velocity fields for SE(3) Flow Matching. The framework enables probabilistic sampling of multi-modal pose hypotheses and supports two complementary pose selection strategies: a model-free clustering approach and a model-based geometric scoring.
  • Figure 3: Illustrating the mean grasp pose velocity under pose uncertainty. EquiGraspFlow velocities are averaged per pose hypothesis to form a mean field, which is integrated to sample grasps that are robust to pose ambiguity (e.g., favouring top grasps for a mug with an occluded handle).
  • Figure 4: Qualitative comparison of pose estimation on YCB-V, LM-O and Real275 datasets.
  • Figure 5: Uncertainty-aware grasping on a mug. Left: Occluded case with a multi-modal sample-based distribution of pose hypotheses; Sampling grasps using EquiGraspFlow while marginalizing over the multiple pose hypotheses generates top-down grasps that remain valid across all pose hypotheses. Right: Non-occluded case with a unimodal distribution, i.e., all the samples agree on a single pose hypothesis; Sampling grasps using EquiGraspFlow while marginalizing over the multiple pose hypotheses (which now coincide to one pose) also produces side grasps targeting the handle.