Table of Contents
Fetching ...

CRISP: Object Pose and Shape Estimation with Test-Time Adaptation

Jingnan Shi, Rajat Talak, Harry Zhang, David Jin, Luca Carlone

TL;DR

CRISP tackles category-agnostic object pose and shape estimation from RGB-D by fusing a FiLM-conditioned implicit shape decoder with a DPT-based pose estimator, yielding robust pose (R,t) and shape SDF estimates. It introduces a bi-level pose-and-shape corrector and an active shape decoder to enable efficient refinement, plus CRISP-ST, a test-time self-training framework using observable-certificates to generate pseudo-labels without synthetic data. Across YCBV, SPE3R, and NOCS, CRISP achieves strong shape reconstructions and competitive or superior pose accuracy, while self-training bridges large domain gaps and enables generalization to unseen objects. The combination of fast inference, multi-view refinement, and test-time adaptation makes CRISP a practical foundation for real-time robotic perception and manipulation tasks.

Abstract

We consider the problem of estimating object pose and shape from an RGB-D image. Our first contribution is to introduce CRISP, a category-agnostic object pose and shape estimation pipeline. The pipeline implements an encoder-decoder model for shape estimation. It uses FiLM-conditioning for implicit shape reconstruction and a DPT-based network for estimating pose-normalized points for pose estimation. As a second contribution, we propose an optimization-based pose and shape corrector that can correct estimation errors caused by a domain gap. Observing that the shape decoder is well behaved in the convex hull of known shapes, we approximate the shape decoder with an active shape model, and show that this reduces the shape correction problem to a constrained linear least squares problem, which can be solved efficiently by an interior point algorithm. Third, we introduce a self-training pipeline to perform self-supervised domain adaptation of CRISP. The self-training is based on a correct-and-certify approach, which leverages the corrector to generate pseudo-labels at test time, and uses them to self-train CRISP. We demonstrate CRISP (and the self-training) on YCBV, SPE3R, and NOCS datasets. CRISP shows high performance on all the datasets. Moreover, our self-training is capable of bridging a large domain gap. Finally, CRISP also shows an ability to generalize to unseen objects. Code and pre-trained models will be available on https://web.mit.edu/sparklab/research/crisp_object_pose_shape/.

CRISP: Object Pose and Shape Estimation with Test-Time Adaptation

TL;DR

CRISP tackles category-agnostic object pose and shape estimation from RGB-D by fusing a FiLM-conditioned implicit shape decoder with a DPT-based pose estimator, yielding robust pose (R,t) and shape SDF estimates. It introduces a bi-level pose-and-shape corrector and an active shape decoder to enable efficient refinement, plus CRISP-ST, a test-time self-training framework using observable-certificates to generate pseudo-labels without synthetic data. Across YCBV, SPE3R, and NOCS, CRISP achieves strong shape reconstructions and competitive or superior pose accuracy, while self-training bridges large domain gaps and enables generalization to unseen objects. The combination of fast inference, multi-view refinement, and test-time adaptation makes CRISP a practical foundation for real-time robotic perception and manipulation tasks.

Abstract

We consider the problem of estimating object pose and shape from an RGB-D image. Our first contribution is to introduce CRISP, a category-agnostic object pose and shape estimation pipeline. The pipeline implements an encoder-decoder model for shape estimation. It uses FiLM-conditioning for implicit shape reconstruction and a DPT-based network for estimating pose-normalized points for pose estimation. As a second contribution, we propose an optimization-based pose and shape corrector that can correct estimation errors caused by a domain gap. Observing that the shape decoder is well behaved in the convex hull of known shapes, we approximate the shape decoder with an active shape model, and show that this reduces the shape correction problem to a constrained linear least squares problem, which can be solved efficiently by an interior point algorithm. Third, we introduce a self-training pipeline to perform self-supervised domain adaptation of CRISP. The self-training is based on a correct-and-certify approach, which leverages the corrector to generate pseudo-labels at test time, and uses them to self-train CRISP. We demonstrate CRISP (and the self-training) on YCBV, SPE3R, and NOCS datasets. CRISP shows high performance on all the datasets. Moreover, our self-training is capable of bridging a large domain gap. Finally, CRISP also shows an ability to generalize to unseen objects. Code and pre-trained models will be available on https://web.mit.edu/sparklab/research/crisp_object_pose_shape/.

Paper Structure

This paper contains 39 sections, 1 theorem, 24 equations, 13 figures, 13 tables, 2 algorithms.

Key Result

Lemma 4

If $p^\ast$ and $f^{\ast}$ are the optimal values of the problems eq:corrector-01 and eq:corrector-02, respectively, then Let $({\bm Z}^\ast, {\bm h}^\ast, {\bm R}^\ast, {\bm t}^\ast)$ be the optimal solution to eq:corrector-02, and $\Delta = {\bm R}^\ast {\bm X} + {\bm t}^{\ast}{\mathbf 1}_n^{\mathsf{T}} - {\bm Z}^{\ast}$, then $({\bm Z}^\ast + \Delta, {\bm h}^\ast, {\bm R}^\ast, {\bm t}^\ast)$

Figures (13)

  • Figure 1: We introduce CRISP, a category-agnostic object pose and shape estimation pipeline, and a test-time adaptive self-training method CRISP-ST to bridge domain gaps. Top: Qualitative examples of CRISP on the YCBV dataset Xiang17rss-posecnn. Bottom: Qualitative examples of CRISP on the SPE3R dataset Park24aiaa-spe3r.
  • Figure 2: Overview of our contributions. Given an segmented RGB image ${\cal I}$ and depth points ${\bm X}$ of the object, CRISP extracts features from the cropped image. It estimates the object pose, by estimating pose-normalized coordinates (PNC) ${\bm Z}$, and shape, by reconstructing the signed distance field (SDF) of the object. The pose and shape estimates are corrected by the corrector which solves a bi-level optimization problem using two solvers: BCD (Alg. \ref{['alg:corrector']}) and LSQ (Alg. \ref{['alg:corrector-lsq']}). The self-training uses corrected estimates that pass the observable certification check \ref{['eq:oc-cert']} as pseudo-labels. The SDF decoder is fixed during self-training.
  • Figure 3: Visualization of the mesh extracted from SDF produced by the shape decoder $f_d(\cdot~|~{\bm h})$ as the latent shape code takes values ${\bm h} = \alpha {\bm h}_1 + (1 - \alpha) {\bm h}_2$ given two shapes codes ${\bm h}_1$ and ${\bm h}_2$. The trained decoder does not produce plausible shapes at extrapolation.
  • Figure 4: The minimum eigenvalue of the matrix ${\bm F}({\bm Z})^{\mathsf{T}} {\bm F}({\bm Z})$ as a function of keyframes $N$. Each keyframe captures the mug from a different viewing angle. ${\bm F}({\bm Z})$ is computed using the estimated PNC ${\bm Z}$, aggregated over all keyframes till $N$.
  • Figure 5: Qualitative examples of CRISP on the YCBV dataset. Top: projection of transformed reconstructed mesh with our estimation. Bottom: reconstructed mesh. See Appendix for more examples.
  • ...and 8 more figures

Theorems & Definitions (5)

  • Remark 1: Difference from NOCS Wang19-normalizedCoordinate
  • Remark 2: Correcting to Ground Truth
  • Remark 3: Multi-View Pose and Shape Corrector
  • Lemma 4
  • proof