Table of Contents
Fetching ...

SCFlow2: Plug-and-Play Object Pose Refiner with Shape-Constraint Scene Flow

Qingyuan Wang, Rui Song, Jiaojiao Li, Kerui Cheng, David Ferstl, Yinlin Hu

TL;DR

SCFlow2 addresses the challenge of refining 6D object poses without retraining for novel objects by integrating a 3D scene flow representation with RGBD depth regularization and a 3D shape prior within an end-to-end trainable, plug-and-play framework. It constructs a 4D correlation volume from RGB and depth features, uses a GRU-based predictor to estimate a dense SE(3) transformation field, and derives a global pose residual that guides iterative refinement via a pose-induced flow, all while leveraging the target's shape prior to constrain the search. Trained on ShapeNet, Google-Scanned-Objects, and Objaverse, SCFlow2 achieves state-of-the-art accuracy on seven BOP datasets with novel objects and delivers fast inference (~0.18 s per pose) compared with multi-hypothesis refinement methods. The approach demonstrates strong generalization, effective ablations showing the necessity of both the shape prior and the 3D scene flow representation, and broad practical impact for real-world pose estimation systems.

Abstract

We introduce SCFlow2, a plug-and-play refinement framework for 6D object pose estimation. Most recent 6D object pose methods rely on refinement to get accurate results. However, most existing refinement methods either suffer from noises in establishing correspondences, or rely on retraining for novel objects. SCFlow2 is based on the SCFlow model designed for refinement with shape constraint, but formulates the additional depth as a regularization in the iteration via 3D scene flow for RGBD frames. The key design of SCFlow2 is an introduction of geometry constraints into the training of recurrent matching network, by combining the rigid-motion embeddings in 3D scene flow and 3D shape prior of the target. We train SCFlow2 on a combination of dataset Objaverse, GSO and ShapeNet, and evaluate on BOP datasets with novel objects. After using our method as a post-processing, most state-of-the-art methods produce significantly better results, without any retraining or fine-tuning. The source code is available at https://scflow2.github.io.

SCFlow2: Plug-and-Play Object Pose Refiner with Shape-Constraint Scene Flow

TL;DR

SCFlow2 addresses the challenge of refining 6D object poses without retraining for novel objects by integrating a 3D scene flow representation with RGBD depth regularization and a 3D shape prior within an end-to-end trainable, plug-and-play framework. It constructs a 4D correlation volume from RGB and depth features, uses a GRU-based predictor to estimate a dense SE(3) transformation field, and derives a global pose residual that guides iterative refinement via a pose-induced flow, all while leveraging the target's shape prior to constrain the search. Trained on ShapeNet, Google-Scanned-Objects, and Objaverse, SCFlow2 achieves state-of-the-art accuracy on seven BOP datasets with novel objects and delivers fast inference (~0.18 s per pose) compared with multi-hypothesis refinement methods. The approach demonstrates strong generalization, effective ablations showing the necessity of both the shape prior and the 3D scene flow representation, and broad practical impact for real-world pose estimation systems.

Abstract

We introduce SCFlow2, a plug-and-play refinement framework for 6D object pose estimation. Most recent 6D object pose methods rely on refinement to get accurate results. However, most existing refinement methods either suffer from noises in establishing correspondences, or rely on retraining for novel objects. SCFlow2 is based on the SCFlow model designed for refinement with shape constraint, but formulates the additional depth as a regularization in the iteration via 3D scene flow for RGBD frames. The key design of SCFlow2 is an introduction of geometry constraints into the training of recurrent matching network, by combining the rigid-motion embeddings in 3D scene flow and 3D shape prior of the target. We train SCFlow2 on a combination of dataset Objaverse, GSO and ShapeNet, and evaluate on BOP datasets with novel objects. After using our method as a post-processing, most state-of-the-art methods produce significantly better results, without any retraining or fine-tuning. The source code is available at https://scflow2.github.io.

Paper Structure

This paper contains 13 sections, 3 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Object pose refinement is critical for accurate object pose estimation. (a) Most existing object pose refinement methods, including SCFlow hai2023scflow, rely on retraining for novel objects to achieve high accuracy. (b) The proposed SCFlow2 achieves even higher accuracy, and more importantly, generalizes well to novel objects without any retraining or fine-tuning.
  • Figure 2: Design overview of SCFlow and SCFlow2. Given the object 3D mesh, we render an image $I_1$ and depth map $D_1$ based on an initial pose, and then use networks to compare these rendered outputs with the real input $I_2$ and $D_2$ to refine the pose. (a) Although SCFlow hai2023scflow adds 3D shape constraint into the optimization loop, it formulates the matching process as a pure 2D problem, which is less effective in capturing 3D motions. On the other hand, it cannot work with RGBD images. A common practice is to use RANSAC Kabsch kabsch1976kabsch to consume additional depth as a second stage, which however is only local optimal within each stage. (b) SCFlow2 tackles these problems. We introduce an intermediate representation based on 3D scene flow to capture 3D motions in network optimization. Furthermore, we embed depth into the loop by formulating depth as an additional regularization to guide the correlation look-up iteratively, producing an end-to-end trainable system with RGBD images.
  • Figure 3: Overview of SCFlow2. Given an RGBD image and an initial pose of the target, we first render the target to obtain a synthetic RGBD image as the reference, and use an RGB encoder and depth encoder to extract features from the image pair, which will be used to create a 4D correlation volume. Based on the correlation volume and GRU, We use an intermediate flow regressor to predict 3D scene flow that is represented as a dense 3D transformation field $\mathbf{T}_{k}^{\prime}$. We then use a pose regressor to predict a global pose update $\Delta\mathbf{P}_{k}$ based on implicit voting from the pixel-wise 3D transformation field. Finally, the updated pose $\mathbf{P}_{k}$ is used to compute the pose-induced flow $F_{k}$ based on the target mesh to index the correlation volume for the next iteration. Note how the depth and 3D shape of the target are embedded into the framework to guide the optimization iteratively.
  • Figure 4: Visualization of pose-induced flow with 3D scene flow representation. Given a real input image and the rendered image based on an initial pose, we predict 3D scene flow represented as a dense SE3 motion field. We represent the motion field as a twist field ($\tau$, $\theta$). In theory, the motion fields should be constant pixels for rigid objects. We use a pose regressor to predict a global object-level pose based on the noisy motion fields. We then use the updated global pose to generate a pose-induced flow based on the target 3D mesh. The pose-induced flow embeds the shape prior of the target and reduces the search space for matching.
  • Figure 5: Qualitative results. Given the same pose initialization as that in GenFlow ("GFlow") and FoundPose ("FPose"), denoted as "GFlow (init)" and "FPose (init)" respectively, our refinement method ("+ Ours") produces considerably more accurate results compared to the refinement approaches in their original methods (note how our reprojected 3D mesh aligns better with the object contours).
  • ...and 3 more figures