Table of Contents
Fetching ...

Finding NeMO: A Geometry-Aware Representation of Template Views for Few-Shot Perception

Sebastian Jung, Leonard Klüpfel, Rudolph Triebel, Maximilian Durner

TL;DR

The paper tackles enabling reliable perception of unseen objects from few RGB template views without camera calibration or retraining. It introduces NeMO, a geometry-aware object-centric representation that encodes object geometry as a sparse 3D point cloud derived from a learned unsigned distance function and decouples object information from network weights, enabling a single network to perform detection, segmentation, and 6DoF pose estimation across model-free and model-based settings. The encoder–decoder architecture, trained with a combination of geometric and dense-prediction losses, supports offline precomputation and scalable multi-view fusion, achieving competitive or state-of-the-art results on BOP benchmarks and providing qualitative surface reconstruction capabilities. The approach is complemented by a synthetic, object-centric dataset and extensive supplementary analyses, highlighting its potential for quick object onboarding and robust generalization to novel instances without retraining.

Abstract

We present Neural Memory Object (NeMO), a novel object-centric representation that can be used to detect, segment and estimate the 6DoF pose of objects unseen during training using RGB images. Our method consists of an encoder that requires only a few RGB template views depicting an object to generate a sparse object-like point cloud using a learned UDF containing semantic and geometric information. Next, a decoder takes the object encoding together with a query image to generate a variety of dense predictions. Through extensive experiments, we show that our method can be used for few-shot object perception without requiring any camera-specific parameters or retraining on target data. Our proposed concept of outsourcing object information in a NeMO and using a single network for multiple perception tasks enhances interaction with novel objects, improving scalability and efficiency by enabling quick object onboarding without retraining or extensive pre-processing. We report competitive and state-of-the-art results on various datasets and perception tasks of the BOP benchmark, demonstrating the versatility of our approach. https://github.com/DLR-RM/nemo

Finding NeMO: A Geometry-Aware Representation of Template Views for Few-Shot Perception

TL;DR

The paper tackles enabling reliable perception of unseen objects from few RGB template views without camera calibration or retraining. It introduces NeMO, a geometry-aware object-centric representation that encodes object geometry as a sparse 3D point cloud derived from a learned unsigned distance function and decouples object information from network weights, enabling a single network to perform detection, segmentation, and 6DoF pose estimation across model-free and model-based settings. The encoder–decoder architecture, trained with a combination of geometric and dense-prediction losses, supports offline precomputation and scalable multi-view fusion, achieving competitive or state-of-the-art results on BOP benchmarks and providing qualitative surface reconstruction capabilities. The approach is complemented by a synthetic, object-centric dataset and extensive supplementary analyses, highlighting its potential for quick object onboarding and robust generalization to novel instances without retraining.

Abstract

We present Neural Memory Object (NeMO), a novel object-centric representation that can be used to detect, segment and estimate the 6DoF pose of objects unseen during training using RGB images. Our method consists of an encoder that requires only a few RGB template views depicting an object to generate a sparse object-like point cloud using a learned UDF containing semantic and geometric information. Next, a decoder takes the object encoding together with a query image to generate a variety of dense predictions. Through extensive experiments, we show that our method can be used for few-shot object perception without requiring any camera-specific parameters or retraining on target data. Our proposed concept of outsourcing object information in a NeMO and using a single network for multiple perception tasks enhances interaction with novel objects, improving scalability and efficiency by enabling quick object onboarding without retraining or extensive pre-processing. We report competitive and state-of-the-art results on various datasets and perception tasks of the BOP benchmark, demonstrating the versatility of our approach. https://github.com/DLR-RM/nemo
Paper Structure (18 sections, 8 equations, 11 figures, 9 tables, 1 algorithm)

This paper contains 18 sections, 8 equations, 11 figures, 9 tables, 1 algorithm.

Figures (11)

  • Figure 1: Overview. Our method uses a multi-view encoder to generate an object-centric geometric encoding called Neural Memory Object (NeMO) with its own coordinate system from a set of RGB images depicting an object unseen during training. A decoder uses the NeMO to retrieve dense predictions allowing us to detect, segment, estimate the objects surface and determine the camera-to-object position on an RGB query image. Even in cluttered scenes, our method is able to find the object, which we can use to crop the corresponding region of interest, demonstrating that our method can be used for multi-stage perception pipelines. Images were captured using a normal smartphone.
  • Figure 2: Overview of the approach. RGB template views are first processed by a ViT dosovitskiy2020image and a multi-view encoder jiang_leap_2023, producing updated image features. To incorporate spatial information, a randomly-sampled point cloud is processed through a MLP and attended to the image features in our proposed Geometric Mapping block, yielding feature-enhanced 3D points that form the NeMO. A decoder attends a query image with the NeMO using multiple Cross- and Self-Attention blocks to generate multiple dense predictions. We represent the NeMO as a point cloud and reduce the higher dimensional NeMO features to RGB using PCA sarker2021ml. The PCA reduction shows a relation between the learned features and the objects geometry and semantics.
  • Figure 3: Geometric Mapping Block. We fuse the updated image features (key-value pairs) with the pre-processed geometric features (queries) in multiple transformer decoder blocks. The features are then forwarded to a that estimates the unsigned distance of the initial point cloud to the estimated object surface. After further processing these points via a , we combine them with the updated geometric features, resulting in the .
  • Figure 4: Qualitative Example of Model-Free Few-Shot Detection and Pose Estimation on HOPEv2. Left shows the scene without annotations, right shows detections in green and pose estimations with refinement as rendered overlays. Even in the underexposed scene the model predicts reasonable results.
  • Figure 5: Object Surface Reconstruction and Camera Pose Estimation on Unseen Objects. We show object surface points and camera poses as predicted by the decoder based on four images of randomly chosen objects in different scenarios: (Left) A static coffee machine standing on a table, captured by a dynamic camera. (Middle) A label machine in different environments, including occlusions. (Right) An espresso mug manipulated in hand, captured by a static camera. In all three scenarios, our model is able to predict object-centric camera poses and surface points. We map RGB pixel color to corresponding 3D point to show correct 2D-3D mapping. Blue is the anchor image.
  • ...and 6 more figures