Enhancing Embodied Object Detection through Language-Image Pre-training and Implicit Object Memory

Nicolas Harvey Chapman; Feras Dayoub; Will Browne; Chris Lehnert

Enhancing Embodied Object Detection through Language-Image Pre-training and Implicit Object Memory

Nicolas Harvey Chapman, Feras Dayoub, Will Browne, Chris Lehnert

TL;DR

The paper addresses embodied object detection in robotics by integrating detectors trained on language-image data with a novel implicit memory that aggregates object features across long temporal and spatial horizons using projective geometry. A memory-augmented pipeline reads and writes to a ground-plane memory grid, enhancing pixel features to improve 2D detections while maintaining open-vocabulary capabilities. Across Habitat Matterport3D and Replica datasets, the approach yields substantial gains over vanilla and other external-memory baselines, including improvements of up to 3.09 mAP and strong open-vocabulary performance, with demonstrated robustness to depth/pose noise and domain shift. Real-world robot experiments validate practicality, showing tangible improvements in recall and precision and highlighting avenues for handling dynamic objects and unseen classes in future work.

Abstract

Deep-learning and large scale language-image training have produced image object detectors that generalise well to diverse environments and semantic classes. However, single-image object detectors trained on internet data are not optimally tailored for the embodied conditions inherent in robotics. Instead, robots must detect objects from complex multi-modal data streams involving depth, localisation and temporal correlation, a task termed embodied object detection. Paradigms such as Video Object Detection (VOD) and Semantic Mapping have been proposed to leverage such embodied data streams, but existing work fails to enhance performance using language-image training. In response, we investigate how an image object detector pre-trained using language-image data can be extended to perform embodied object detection. We propose a novel implicit object memory that uses projective geometry to aggregate the features of detected objects across long temporal horizons. The spatial and temporal information accumulated in memory is then used to enhance the image features of the base detector. When tested on embodied data streams sampled from diverse indoor scenes, our approach improves the base object detector by 3.09 mAP, outperforming alternative external memories designed for VOD and Semantic Mapping. Our method also shows a significant improvement of 16.90 mAP relative to baselines that perform embodied object detection without first training on language-image data, and is robust to sensor noise and domain shift experienced in real-world deployment.

Enhancing Embodied Object Detection through Language-Image Pre-training and Implicit Object Memory

TL;DR

Abstract

Paper Structure (28 sections, 14 equations, 7 figures, 4 tables)

This paper contains 28 sections, 14 equations, 7 figures, 4 tables.

Introduction
Related Work
Embodied Object Detection
Language-Image Pre-training for Object Detection
Video Object Detection
Semantic Mapping and 3D Object Detection
Preliminaries
Problem Formulation
Object Detection with Language-Image Embeddings
Method
Spatial Structure of Implicit Object Memory
Writing to Implicit Object Memory
Pixel Feature Enhancement with Implicit Object Memory
Baseline External Memories
Dataset Experiments
...and 13 more sections

Figures (7)

Figure 1: Our proposed method for enhancing embodied object detection with language-image training and a novel external memory (top). Our implicit object memory uses projective geometry to aggregate the features of detected objects across long temporal horizons. The spatial and temporal information stored in this external memory is then used to enhance the image features of the base detector. We evaluate our method on embodied data streams sampled from two datasets of indoor scenes (bottom). Relative to performing vanilla object detection (blue), the inclusion of language-image pre-training (red) leads to an increase of 7.40 mAP on Matterport3D and 16.90 mAP on the Replica test sets. Adding our implicit object memory (green) results in a further 2.46 mAP and 3.09 mAP improvement respectively.
Figure 2: Our implicit object memory for enhancing the feature space of a base object detector with spatial and temporal information. Our memory write operation (orange) first selects all objects predicted by the base detector with class-specific likelihood score $s^{t}$ greater than some threshold $\tau_{s}$. Projective geometry is then used to map the object features $z_{o}^{t}$ from their location in the image-frame to the corresponding location in spatial memory. The projected object features $F^{t}$ are then summed at each location with those already in implicit object memory $M^{t}$ to generate the updated memory matrix $M^{t+1}$. A count of the number of times each memory location is viewed by the robot $V^{t}$ is also incremented. The read operation involves normalising each feature in implicit object memory $M^{t}$ based on how many times the location has been viewed. The normalised features $\left|M^{t}\right|$ are projected back into the image-frame using the reverse image projection. The resulting egocentric memory features $z_{m}^{t}$ are passed through a linear projection layer $W_{m}$ to align them with the original pixel features $z^{t}_{p}$. To produce the final enhance pixel features $z^{t}_{e}$ the egocentric memory features are weighted by a coefficient $\lambda$ and summed with the original pixel features.
Figure 3: Performance of alternative external memories across different sequence lengths on the Matterport3D test set. All external memories are training with fine-tuned DETIC as the base model. To generate results across 100 episodes, the 50 episodes from each scene are processed twice in sequential order, with the external memory allowed to persist across all 100 episodes. Performance is calculated separately for different stages to investigate if the external memory becomes more useful as the number of episodes increases. Performance across a single episode is also reported, whereby the memory is instead reset at the end of every episode.
Figure 4: Performance of alternative external memories when subjected to noise in depth and localisation. All external memories are training with fine-tuned DETIC as the base model. Gaussian noise with a standard deviation of 0.1m is added to the depth and position reading, and 0.01 radians to the robot heading. The standard deviation is scaled by a factor of 2, 5 and 10 until significant performance degradation is realised.
Figure 5: Sensitivity of proposed external memories to key hyper-parameters on the Matterport3D test set. In each test, the isolated parameter is swept while keeping all other aspects of implementation constant and fine-tuned DETIC is used as the base model. The dotted line represents the performance of the fine-tuned DETIC model.
...and 2 more figures

Enhancing Embodied Object Detection through Language-Image Pre-training and Implicit Object Memory

TL;DR

Abstract

Enhancing Embodied Object Detection through Language-Image Pre-training and Implicit Object Memory

Authors

TL;DR

Abstract

Table of Contents

Figures (7)