Table of Contents
Fetching ...

Value Iteration Networks

Aviv Tamar, Yi Wu, Garrett Thomas, Sergey Levine, Pieter Abbeel

TL;DR

<3-5 sentence high-level summary> This paper addresses the gap between reactive neural policies and explicit planning in sequential decision tasks. It introduces the Value Iteration Network (VIN), a differentiable module that implements an approximate value-iteration planner inside a CNN and can be trained end-to-end. The authors demonstrate VIN-based policies across grid-world navigation, Mars terrain navigation, continuous control, and WebNav, showing improved generalization to unseen task instances. They discuss extensions such as hierarchical VI networks (HVIN) to speed planning and combine multiple planning computations. The work provides a general planning primitive that can be integrated with perception and control in RL/IL systems.

Abstract

We introduce the value iteration network (VIN): a fully differentiable neural network with a `planning module' embedded within. VINs can learn to plan, and are suitable for predicting outcomes that involve planning-based reasoning, such as policies for reinforcement learning. Key to our approach is a novel differentiable approximation of the value-iteration algorithm, which can be represented as a convolutional neural network, and trained end-to-end using standard backpropagation. We evaluate VIN based policies on discrete and continuous path-planning domains, and on a natural-language based search task. We show that by learning an explicit planning computation, VIN policies generalize better to new, unseen domains.

Value Iteration Networks

TL;DR

<3-5 sentence high-level summary> This paper addresses the gap between reactive neural policies and explicit planning in sequential decision tasks. It introduces the Value Iteration Network (VIN), a differentiable module that implements an approximate value-iteration planner inside a CNN and can be trained end-to-end. The authors demonstrate VIN-based policies across grid-world navigation, Mars terrain navigation, continuous control, and WebNav, showing improved generalization to unseen task instances. They discuss extensions such as hierarchical VI networks (HVIN) to speed planning and combine multiple planning computations. The work provides a general planning primitive that can be integrated with perception and control in RL/IL systems.

Abstract

We introduce the value iteration network (VIN): a fully differentiable neural network with a `planning module' embedded within. VINs can learn to plan, and are suitable for predicting outcomes that involve planning-based reasoning, such as policies for reinforcement learning. Key to our approach is a novel differentiable approximation of the value-iteration algorithm, which can be represented as a convolutional neural network, and trained end-to-end using standard backpropagation. We evaluate VIN based policies on discrete and continuous path-planning domains, and on a natural-language based search task. We show that by learning an explicit planning computation, VIN policies generalize better to new, unseen domains.

Paper Structure

This paper contains 23 sections, 1 equation, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Two instances of a grid-world domain. Task is to move to the goal between the obstacles.
  • Figure 2: Planning-based NN models. Left: a general policy representation that adds value function features from a planner to a reactive policy. Right: VI module -- a CNN representation of VI algorithm.
  • Figure 3: Grid-world domains (best viewed in color). A,B: Two random instances of the $28 \times 28$ synthetic gridworld, with the VIN-predicted trajectories and ground-truth shortest paths between random start and goal positions. C: An image of the Mars domain, with points of elevation sharper than $10^{\circ}$ colored in red. These points were calculated from a matching image of elevation data (not shown), and were not available to the learning algorithm. Note the difficulty of distinguishing between obstacles and non-obstacles. D: The VIN-predicted (purple line with cross markers), and the shortest-path ground truth (blue line) trajectories between between random start and goal positions.
  • Figure 4: Continuous control domain. Top: average distance to goal on training and test domains for VIN and CNN policies. Bottom: trajectories predicted by VIN and CNN on test domains.
  • Figure 5: Visualization of learned reward and value function. Left: a sample domain. Center: learned reward $f_R$ for this domain. Right: resulting value function (in VI block) for this domain.
  • ...and 2 more figures