NOPE: Novel Object Pose Estimation from a Single Image

Van Nguyen Nguyen; Thibault Groueix; Yinlin Hu; Mathieu Salzmann; Vincent Lepetit

NOPE: Novel Object Pose Estimation from a Single Image

Van Nguyen Nguyen, Thibault Groueix, Yinlin Hu, Mathieu Salzmann, Vincent Lepetit

TL;DR

NOPE addresses the challenge of estimating the relative 3D pose of unseen objects from a single image without requiring a 3D model or retraining. It learns to predict average embeddings of novel views conditioned on relative pose using a U-Net with attention, and performs fast template matching over a fixed set of viewpoints to recover the pose, while also detecting ambiguities due to symmetry or occlusion. The approach demonstrates strong generalization to novel categories on ShapeNet and robust results on T-LESS, with runtime around 1 s on a single GPU and robustness to occlusions, offering a practical solution for rapid pose estimation in robotics and AR. Overall, NOPE enables one-shot pose estimation for unseen objects, identifies pose ambiguities, and delivers fast, model-free performance suitable for real-time applications.

Abstract

The practicality of 3D object pose estimation remains limited for many applications due to the need for prior knowledge of a 3D model and a training period for new objects. To address this limitation, we propose an approach that takes a single image of a new object as input and predicts the relative pose of this object in new images without prior knowledge of the object's 3D model and without requiring training time for new objects and categories. We achieve this by training a model to directly predict discriminative embeddings for viewpoints surrounding the object. This prediction is done using a simple U-Net architecture with attention and conditioned on the desired pose, which yields extremely fast inference. We compare our approach to state-of-the-art methods and show it outperforms them both in terms of accuracy and robustness. Our source code is publicly available at https://github.com/nv-nguyen/nope

NOPE: Novel Object Pose Estimation from a Single Image

TL;DR

Abstract

Paper Structure (25 sections, 4 equations, 8 figures, 4 tables)

This paper contains 25 sections, 4 equations, 8 figures, 4 tables.

Introduction
Related Work
Novel view synthesis from a single image
Generalizable object pose estimation
Method
Formalization
Framework
Training.
Pose prediction
Template matching.
Detecting pose ambiguities.
Experiments
Experimental setup
Synthetic dataset.
Real-world dataset.
...and 10 more sections

Figures (8)

Figure 1: Given as input a single reference view of a novel object, our method predicts the relative 3D pose (rotation) of a query view and its ambiguities. We visualize the predicted pose by rendering the object from this pose, but the 3D model is only used for visualization purposes, not as input to our method. Our method works by estimating a probability distribution over the space of 3D poses, visualized here on a sphere centered on the object. We use the canonical pose of the 3D model to visualize this distribution, but not as input to our method. From this distribution, we can also identify the pose ambiguities: For example, in the case of the bottle, any pose with the same pitch and roll is possible; in the case of the mug, a range of poses are possible as the handle is not visible in the query image. Our method is also robust to partial occlusions, as shown on the clock hidden in part by a rectangle in the query image.
Figure 2: The limit of novel view synthesis for pose prediction. While the images generated by Wonder3D long2023wonder3d look very realistic, they have to invent unseen parts, impairing the similarity computation between the query image and the generated view, and hence the pose estimation: The probability distributions computed by template matching do not peak on the right pose but show many wrong local maxima. This is not a limitation of Wonder3D but of view synthesis from a single view in general.
Figure 3: Overview. During training, we train a U-Net to predict the embedding of a novel view of an object, given a reference image of the object and a relative pose. The U-Net is conditioned on an embedding of the relative pose computed using an MLP, which we train jointly with the U-Net. At inference, our method first takes as input a reference image of a new object and predicts the embeddings of views of the object under many relative poses. This inference takes around 1 second on a single GPU V100. Then, given a query image of the object, we first compute its embedding and match it against the set of predicted embeddings. This gives us a distribution over the possible relative poses between the reference and query images, where the maximum corresponds to the predicted pose.
Figure 4: Object symmetries and the pose ambiguities they may generate, as estimated by our method given a pair of reference and query images.
Figure 5: Visualization of training and test sets from the ShapeNet dataset chang2015shapenet . The shapes and appearances in the training and test sets are very different and thus constitute a good test bed for generalization to unseen categories.
...and 3 more figures

NOPE: Novel Object Pose Estimation from a Single Image

TL;DR

Abstract

NOPE: Novel Object Pose Estimation from a Single Image

Authors

TL;DR

Abstract

Table of Contents

Figures (8)