Table of Contents
Fetching ...

Diverse Policies Recovering via Pointwise Mutual Information Weighted Imitation Learning

Hanlin Yang, Jian Yao, Weiming Liu, Qing Wang, Hanmin Qin, Hansheng Kong, Kirk Tang, Jiechao Xiong, Chao Yu, Kai Li, Junliang Xing, Hongwu Chen, Juchao Zhuo, Qiang Fu, Yang Wei, Haobo Fu

TL;DR

This paper enhances the vanilla behavioral cloning learning objective after inferring or assigning a latent style for a trajectory, and incorporates a weighting mechanism based on pointwise mutual information that reflects the significance of each state-action pair's contribution to learning the style.

Abstract

Recovering a spectrum of diverse policies from a set of expert trajectories is an important research topic in imitation learning. After determining a latent style for a trajectory, previous diverse policies recovering methods usually employ a vanilla behavioral cloning learning objective conditioned on the latent style, treating each state-action pair in the trajectory with equal importance. Based on an observation that in many scenarios, behavioral styles are often highly relevant with only a subset of state-action pairs, this paper presents a new principled method in diverse polices recovery. In particular, after inferring or assigning a latent style for a trajectory, we enhance the vanilla behavioral cloning by incorporating a weighting mechanism based on pointwise mutual information. This additional weighting reflects the significance of each state-action pair's contribution to learning the style, thus allowing our method to focus on state-action pairs most representative of that style. We provide theoretical justifications for our new objective, and extensive empirical evaluations confirm the effectiveness of our method in recovering diverse policies from expert data.

Diverse Policies Recovering via Pointwise Mutual Information Weighted Imitation Learning

TL;DR

This paper enhances the vanilla behavioral cloning learning objective after inferring or assigning a latent style for a trajectory, and incorporates a weighting mechanism based on pointwise mutual information that reflects the significance of each state-action pair's contribution to learning the style.

Abstract

Recovering a spectrum of diverse policies from a set of expert trajectories is an important research topic in imitation learning. After determining a latent style for a trajectory, previous diverse policies recovering methods usually employ a vanilla behavioral cloning learning objective conditioned on the latent style, treating each state-action pair in the trajectory with equal importance. Based on an observation that in many scenarios, behavioral styles are often highly relevant with only a subset of state-action pairs, this paper presents a new principled method in diverse polices recovery. In particular, after inferring or assigning a latent style for a trajectory, we enhance the vanilla behavioral cloning by incorporating a weighting mechanism based on pointwise mutual information. This additional weighting reflects the significance of each state-action pair's contribution to learning the style, thus allowing our method to focus on state-action pairs most representative of that style. We provide theoretical justifications for our new objective, and extensive empirical evaluations confirm the effectiveness of our method in recovering diverse policies from expert data.

Paper Structure

This paper contains 25 sections, 2 theorems, 19 equations, 6 figures, 5 tables, 1 algorithm.

Key Result

Proposition 1

(a). When the mutual information $I(Z;S,A)$ equals to $0$, it indicates that there is no distinction in the trajectory style corresponding to all the state-action pairs. In this case, the BC-PMI objective degenerates to the vanilla behavior cloning objective: (b). When the conditional entropy $H(Z|S,A)$ equals $0$, it indicates that there is a significant distinction in the trajectory style corre

Figures (6)

  • Figure 1: Visualized comparison of trajectories generated by different policies.
  • Figure 2: MI between state-action pairs and styles. FR: Fire Rate style; AR: Movement Area style; RG: Movement Range style.
  • Figure 3: The PMI weight values related to style for each frame along a trajectory, with the corresponding style being the bottom-left area style (green color) in the game frame. The agent in the game frame is indicated by a white circle, and the corresponding actions are indicated by white arrows. No arrow indicates that the action is NOOP.
  • Figure 4: Visualization of different destination styles.
  • Figure 5: Visualization of different curvature styles.
  • ...and 1 more figures

Theorems & Definitions (3)

  • Proposition 1
  • proof
  • Lemma 1: Gibbs' inequality