Table of Contents
Fetching ...

LOC-ZSON: Language-driven Object-Centric Zero-Shot Object Retrieval and Navigation

Tianrui Guan, Yurou Yang, Harry Cheng, Muyuan Lin, Richard Kim, Rajasimman Madhivanan, Arnie Sen, Dinesh Manocha

TL;DR

LOC-ZSON tackles zero-shot object navigation by decoupling retrieval from navigation and introducing a language-driven, object-centric image representation. It combines a slot-attention based object encoder with a text encoder, trained with multi-label and matching losses, and enhanced by LLM-driven data augmentation and prompting to stabilize VLM fine-tuning. The approach yields improvements in text-to-image recall ($1.38$–$13.38\%$) and navigation success rates ($5\%$ in simulation, $16.67\%$ in real-world) on indoor datasets and simulated/real robotics setups. This retrieval-first paradigm demonstrates strong object grounding in memory for open-world navigation and points to future integration with end-to-end RL for exploitation and broader prompting strategies.

Abstract

In this paper, we present LOC-ZSON, a novel Language-driven Object-Centric image representation for object navigation task within complex scenes. We propose an object-centric image representation and corresponding losses for visual-language model (VLM) fine-tuning, which can handle complex object-level queries. In addition, we design a novel LLM-based augmentation and prompt templates for stability during training and zero-shot inference. We implement our method on Astro robot and deploy it in both simulated and real-world environments for zero-shot object navigation. We show that our proposed method can achieve an improvement of 1.38 - 13.38% in terms of text-to-image recall on different benchmark settings for the retrieval task. For object navigation, we show the benefit of our approach in simulation and real world, showing 5% and 16.67% improvement in terms of navigation success rate, respectively.

LOC-ZSON: Language-driven Object-Centric Zero-Shot Object Retrieval and Navigation

TL;DR

LOC-ZSON tackles zero-shot object navigation by decoupling retrieval from navigation and introducing a language-driven, object-centric image representation. It combines a slot-attention based object encoder with a text encoder, trained with multi-label and matching losses, and enhanced by LLM-driven data augmentation and prompting to stabilize VLM fine-tuning. The approach yields improvements in text-to-image recall () and navigation success rates ( in simulation, in real-world) on indoor datasets and simulated/real robotics setups. This retrieval-first paradigm demonstrates strong object grounding in memory for open-world navigation and points to future integration with end-to-end RL for exploitation and broader prompting strategies.

Abstract

In this paper, we present LOC-ZSON, a novel Language-driven Object-Centric image representation for object navigation task within complex scenes. We propose an object-centric image representation and corresponding losses for visual-language model (VLM) fine-tuning, which can handle complex object-level queries. In addition, we design a novel LLM-based augmentation and prompt templates for stability during training and zero-shot inference. We implement our method on Astro robot and deploy it in both simulated and real-world environments for zero-shot object navigation. We show that our proposed method can achieve an improvement of 1.38 - 13.38% in terms of text-to-image recall on different benchmark settings for the retrieval task. For object navigation, we show the benefit of our approach in simulation and real world, showing 5% and 16.67% improvement in terms of navigation success rate, respectively.
Paper Structure (17 sections, 11 equations, 2 figures, 5 tables)

This paper contains 17 sections, 11 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Overview of the proposed LOC-ZSON: Our method performs Language-driven Zero-Shot Object Navigation (L-ZSON) in three steps: 1) Language Reasoning: we use LLM-based Augmentation and prompt engineering to parse the user query. 2) Zero-Shot Retrieval: we propose a novel object-centric image representation to localize the object from a database of images collected from exploration. 3) Navigation: The ground robot navigates to the top N candidates locations based on image-pose pairs.
  • Figure 2: Architecture of our proposed Object-Centric Image Encoder in LOC-ZSON:(A) We introduce a novel object-centric image representation with multi-label training. During training time, we feed one image with multiple object captions into our object-centric image encoder and text encoder. We use Hungarian matching algorithm hungarian to match local image feature with text annotation. (B) For inference, we send query with our prompt template into text encoder and retrieve the top image-pose pairs based on previous explorations. We pass the corresponding poses to the navigation stack for object navigation task. The image embeddings can be pre-computed.