Table of Contents
Fetching ...

Leveraging Unknown Objects to Construct Labeled-Unlabeled Meta-Relationships for Zero-Shot Object Navigation

Yanwei Zheng, Changrui Li, Chuanlin Lan, Yaling Li, Xiao Zhang, Yifei Zou, Dongxiao Yu, Zhipeng Cai

TL;DR

This work tackles zero-shot object navigation by introducing unlabeled objects into training and proposing the Label-Wise Meta-Correlation Module (LWMCM) to exploit relationships between labeled and unlabeled targets. The framework integrates a Target Feature Generator, Unlabeled Object Identifier, Meta Contrastive Feature Modifier, and Meta Object-Graph Learner to produce a relationships embedding that, together with observation and target embeddings, guides a learned navigation policy. Empirical results on AI2THOR and RoboTHOR demonstrate substantial gains in unknown/unseen object navigation, with some trade-offs on known targets due to cross-domain knowledge transfer. Overall, the approach advances zero-shot perception and contextual reasoning in embodied navigation, enabling more robust handling of objects not seen during training and improving generalization to novel scenes.

Abstract

Zero-shot object navigation (ZSON) addresses situation where an agent navigates to an unseen object that does not present in the training set. Previous works mainly train agent using seen objects with known labels, and ignore the seen objects without labels. In this paper, we introduce seen objects without labels, herein termed as ``unknown objects'', into training procedure to enrich the agent's knowledge base with distinguishable but previously overlooked information. Furthermore, we propose the label-wise meta-correlation module (LWMCM) to harness relationships among objects with and without labels, and obtain enhanced objects information. Specially, we propose target feature generator (TFG) to generate the features representation of the unlabeled target objects. Subsequently, the unlabeled object identifier (UOI) module assesses whether the unlabeled target object appears in the current observation frame captured by the camera and produces an adapted target features representation specific to the observed context. In meta contrastive feature modifier (MCFM), the target features is modified via approaching the features of objects within the observation frame while distancing itself from features of unobserved objects. Finally, the meta object-graph learner (MOGL) module is utilized to calculate the relationships among objects based on the features. Experiments conducted on AI2THOR and RoboTHOR platforms demonstrate the effectiveness of our proposed method.

Leveraging Unknown Objects to Construct Labeled-Unlabeled Meta-Relationships for Zero-Shot Object Navigation

TL;DR

This work tackles zero-shot object navigation by introducing unlabeled objects into training and proposing the Label-Wise Meta-Correlation Module (LWMCM) to exploit relationships between labeled and unlabeled targets. The framework integrates a Target Feature Generator, Unlabeled Object Identifier, Meta Contrastive Feature Modifier, and Meta Object-Graph Learner to produce a relationships embedding that, together with observation and target embeddings, guides a learned navigation policy. Empirical results on AI2THOR and RoboTHOR demonstrate substantial gains in unknown/unseen object navigation, with some trade-offs on known targets due to cross-domain knowledge transfer. Overall, the approach advances zero-shot perception and contextual reasoning in embodied navigation, enabling more robust handling of objects not seen during training and improving generalization to novel scenes.

Abstract

Zero-shot object navigation (ZSON) addresses situation where an agent navigates to an unseen object that does not present in the training set. Previous works mainly train agent using seen objects with known labels, and ignore the seen objects without labels. In this paper, we introduce seen objects without labels, herein termed as ``unknown objects'', into training procedure to enrich the agent's knowledge base with distinguishable but previously overlooked information. Furthermore, we propose the label-wise meta-correlation module (LWMCM) to harness relationships among objects with and without labels, and obtain enhanced objects information. Specially, we propose target feature generator (TFG) to generate the features representation of the unlabeled target objects. Subsequently, the unlabeled object identifier (UOI) module assesses whether the unlabeled target object appears in the current observation frame captured by the camera and produces an adapted target features representation specific to the observed context. In meta contrastive feature modifier (MCFM), the target features is modified via approaching the features of objects within the observation frame while distancing itself from features of unobserved objects. Finally, the meta object-graph learner (MOGL) module is utilized to calculate the relationships among objects based on the features. Experiments conducted on AI2THOR and RoboTHOR platforms demonstrate the effectiveness of our proposed method.
Paper Structure (26 sections, 12 equations, 7 figures, 8 tables, 1 algorithm)

This paper contains 26 sections, 12 equations, 7 figures, 8 tables, 1 algorithm.

Figures (7)

  • Figure 1: (a) Details of split target objects. (b) The scenario on the left is training agent solely on limited known objects information and finally agent navigates to the wrong object. The scenario on the right involves introducing unknown objects into the training phase to provide the agent with more information and explicit guidance.
  • Figure 2: Model Overview: $TI$: target indicator, $f_o$: observation features, $f_k$: known objects feature from DETR, $g_i$: generative images of unlabeled targets, $g_t$: generative features, $f_t$: intermediate features of UOI, $f_t'$: output of MCFM, $z_o$: observation embedding, $z_t$: target embedding, $z_r$: relationships embedding. Our LWMCM network consists of four parts to get object features: TFG, UOI, MCFM and MOGL. Then the joint features of target embedding, observation embedding and relationships embedding are input into an LSTM network to predict the next action.
  • Figure 3: Details of the unlabeled object identifier (UOI). The agent uses UOI to identify whether unknown or unseen objects exist in the current field of view based on the target generative features $g_t$ and the observation features $f_o$.
  • Figure 4: Details of the meta contrastive feature modifier (MCFM). The immediate features $f_t$ was brought closer to the features of objects that co-occur with it in the observation frame and was pushed away from the features of objects that are not present in the frame using function $S$ and $l_{mcfm}$.
  • Figure 5: Details of the meta object-graph learner (MOGL). The joint features of known object features from DETR and modified $f_t'$ from MCFM perform as node features of the Object-Graph. Then the CCA-SSG method is used to capture more informative features among these node features.
  • ...and 2 more figures