Table of Contents
Fetching ...

Visual Semantic Navigation using Scene Priors

Wei Yang, Xiaolong Wang, Ali Farhadi, Abhinav Gupta, Roozbeh Mottaghi

TL;DR

The paper addresses robust visual navigation to target object categories in unseen scenes by grounding decisions in semantic priors. It proposes a Graph Convolutional Network that operates on a knowledge graph constructed from Visual Genome, encoding object relationships, and integrates the resulting semantic vector into a deep reinforcement learning policy (A3C) for navigation. Key contributions include: (1) coupling a knowledge-graph representation with RL to encode semantic priors, (2) showing improved navigation performance and generalization to novel objects and scenes, and (3) demonstrating the approach with a scalable graph (|V|=53) and modest computation overhead. The work advances practical scene understanding by enabling agents to leverage functional/semantic priors to search efficiently in unfamiliar environments.

Abstract

How do humans navigate to target objects in novel scenes? Do we use the semantic/functional priors we have built over years to efficiently search and navigate? For example, to search for mugs, we search cabinets near the coffee machine and for fruits we try the fridge. In this work, we focus on incorporating semantic priors in the task of semantic navigation. We propose to use Graph Convolutional Networks for incorporating the prior knowledge into a deep reinforcement learning framework. The agent uses the features from the knowledge graph to predict the actions. For evaluation, we use the AI2-THOR framework. Our experiments show how semantic knowledge improves performance significantly. More importantly, we show improvement in generalization to unseen scenes and/or objects. The supplementary video can be accessed at the following link: https://youtu.be/otKjuO805dE .

Visual Semantic Navigation using Scene Priors

TL;DR

The paper addresses robust visual navigation to target object categories in unseen scenes by grounding decisions in semantic priors. It proposes a Graph Convolutional Network that operates on a knowledge graph constructed from Visual Genome, encoding object relationships, and integrates the resulting semantic vector into a deep reinforcement learning policy (A3C) for navigation. Key contributions include: (1) coupling a knowledge-graph representation with RL to encode semantic priors, (2) showing improved navigation performance and generalization to novel objects and scenes, and (3) demonstrating the approach with a scalable graph (|V|=53) and modest computation overhead. The work advances practical scene understanding by enabling agents to leverage functional/semantic priors to search efficiently in unfamiliar environments.

Abstract

How do humans navigate to target objects in novel scenes? Do we use the semantic/functional priors we have built over years to efficiently search and navigate? For example, to search for mugs, we search cabinets near the coffee machine and for fruits we try the fridge. In this work, we focus on incorporating semantic priors in the task of semantic navigation. We propose to use Graph Convolutional Networks for incorporating the prior knowledge into a deep reinforcement learning framework. The agent uses the features from the knowledge graph to predict the actions. For evaluation, we use the AI2-THOR framework. Our experiments show how semantic knowledge improves performance significantly. More importantly, we show improvement in generalization to unseen scenes and/or objects. The supplementary video can be accessed at the following link: https://youtu.be/otKjuO805dE .

Paper Structure

This paper contains 18 sections, 2 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Our goal is to use scene priors to improve navigation in unseen scenes and towards novel objects. (a) There is no mug in the field of view of the agent, but the likely location for finding a mug is the cabinet near the coffee machine. (b) The agent has not seen a mango before, but it infers that the most likely location for finding a mango is the fridge since similar objects such as apple appear there as well. The most likely locations are shown with the orange box.
  • Figure 2: Overview of the architecture. Our model to incorporate semantic knowledge into semantic navigation. Specifically, we learn a policy network that decides an action based on the visual features of the current state, the semantic target category feature and the features extracted from the knowledge graph. We extract features from the parts of the knowledge graph that are activated.
  • Figure 3: Scene priors. We extract relationships between objects from the Visual Genome krishna2017visual dataset. The relationships for two example object categories are illustrated.
  • Figure 4: Graph Convolutional Networks. Each node denotes an object category and is initialized based on the the current state (image) and the word vector. We use three layers of GCN to perform information propagation. The first two layers output 1024-d latent features, and the last layer generates a single value for each node, which results in a $|V|$ dimensional semantic knowledge vector that is passed to the policy model.
  • Figure 5: Learning curves. The top row shows success rate and the bottom row shows SPL.
  • ...and 1 more figures