Enhancing Embodied Object Detection through Language-Image Pre-training and Implicit Object Memory
Nicolas Harvey Chapman, Feras Dayoub, Will Browne, Chris Lehnert
TL;DR
The paper addresses embodied object detection in robotics by integrating detectors trained on language-image data with a novel implicit memory that aggregates object features across long temporal and spatial horizons using projective geometry. A memory-augmented pipeline reads and writes to a ground-plane memory grid, enhancing pixel features to improve 2D detections while maintaining open-vocabulary capabilities. Across Habitat Matterport3D and Replica datasets, the approach yields substantial gains over vanilla and other external-memory baselines, including improvements of up to 3.09 mAP and strong open-vocabulary performance, with demonstrated robustness to depth/pose noise and domain shift. Real-world robot experiments validate practicality, showing tangible improvements in recall and precision and highlighting avenues for handling dynamic objects and unseen classes in future work.
Abstract
Deep-learning and large scale language-image training have produced image object detectors that generalise well to diverse environments and semantic classes. However, single-image object detectors trained on internet data are not optimally tailored for the embodied conditions inherent in robotics. Instead, robots must detect objects from complex multi-modal data streams involving depth, localisation and temporal correlation, a task termed embodied object detection. Paradigms such as Video Object Detection (VOD) and Semantic Mapping have been proposed to leverage such embodied data streams, but existing work fails to enhance performance using language-image training. In response, we investigate how an image object detector pre-trained using language-image data can be extended to perform embodied object detection. We propose a novel implicit object memory that uses projective geometry to aggregate the features of detected objects across long temporal horizons. The spatial and temporal information accumulated in memory is then used to enhance the image features of the base detector. When tested on embodied data streams sampled from diverse indoor scenes, our approach improves the base object detector by 3.09 mAP, outperforming alternative external memories designed for VOD and Semantic Mapping. Our method also shows a significant improvement of 16.90 mAP relative to baselines that perform embodied object detection without first training on language-image data, and is robust to sensor noise and domain shift experienced in real-world deployment.
