Table of Contents
Fetching ...

Student-Informed Teacher Training

Nico Messikommer, Jiaxu Xing, Elie Aljalbout, Davide Scaramuzza

TL;DR

The paper addresses the problem of teacher-student asymmetry in privileged imitation learning, where a teacher armed with privileged information must be imitated by a student with partial observability. It introduces a joint-training framework that optimizes the teacher not only for task reward but also for imitability, by adding a penalty on states with large teacher-student action mismatch and by aligning the teacher to the student via a KL-divergence term, using a proxy student and a shared action decoder. The approach is instantiated within a PPO-based training loop and includes three phases: roll-out, policy update, and alignment, enabling simultaneous refinement of both policies. Empirical results across maze navigation, vision-based quadrotor obstacle avoidance, and vision-based manipulation demonstrate that the method substantially improves student performance and perception-aware behavior, reducing the teacher-student performance gap and enhancing real-world applicability of privileged imitation learning.

Abstract

Imitation learning with a privileged teacher has proven effective for learning complex control behaviors from high-dimensional inputs, such as images. In this framework, a teacher is trained with privileged task information, while a student tries to predict the actions of the teacher with more limited observations, e.g., in a robot navigation task, the teacher might have access to distances to nearby obstacles, while the student only receives visual observations of the scene. However, privileged imitation learning faces a key challenge: the student might be unable to imitate the teacher's behavior due to partial observability. This problem arises because the teacher is trained without considering if the student is capable of imitating the learned behavior. To address this teacher-student asymmetry, we propose a framework for joint training of the teacher and student policies, encouraging the teacher to learn behaviors that can be imitated by the student despite the latters' limited access to information and its partial observability. Based on the performance bound in imitation learning, we add (i) the approximated action difference between teacher and student as a penalty term to the reward function of the teacher, and (ii) a supervised teacher-student alignment step. We motivate our method with a maze navigation task and demonstrate its effectiveness on complex vision-based quadrotor flight and manipulation tasks.

Student-Informed Teacher Training

TL;DR

The paper addresses the problem of teacher-student asymmetry in privileged imitation learning, where a teacher armed with privileged information must be imitated by a student with partial observability. It introduces a joint-training framework that optimizes the teacher not only for task reward but also for imitability, by adding a penalty on states with large teacher-student action mismatch and by aligning the teacher to the student via a KL-divergence term, using a proxy student and a shared action decoder. The approach is instantiated within a PPO-based training loop and includes three phases: roll-out, policy update, and alignment, enabling simultaneous refinement of both policies. Empirical results across maze navigation, vision-based quadrotor obstacle avoidance, and vision-based manipulation demonstrate that the method substantially improves student performance and perception-aware behavior, reducing the teacher-student performance gap and enhancing real-world applicability of privileged imitation learning.

Abstract

Imitation learning with a privileged teacher has proven effective for learning complex control behaviors from high-dimensional inputs, such as images. In this framework, a teacher is trained with privileged task information, while a student tries to predict the actions of the teacher with more limited observations, e.g., in a robot navigation task, the teacher might have access to distances to nearby obstacles, while the student only receives visual observations of the scene. However, privileged imitation learning faces a key challenge: the student might be unable to imitate the teacher's behavior due to partial observability. This problem arises because the teacher is trained without considering if the student is capable of imitating the learned behavior. To address this teacher-student asymmetry, we propose a framework for joint training of the teacher and student policies, encouraging the teacher to learn behaviors that can be imitated by the student despite the latters' limited access to information and its partial observability. Based on the performance bound in imitation learning, we add (i) the approximated action difference between teacher and student as a penalty term to the reward function of the teacher, and (ii) a supervised teacher-student alignment step. We motivate our method with a maze navigation task and demonstrate its effectiveness on complex vision-based quadrotor flight and manipulation tasks.

Paper Structure

This paper contains 18 sections, 9 equations, 8 figures, 3 tables, 1 algorithm.

Figures (8)

  • Figure 1: Method Overview. (a) We train three networks by freezing weights (grey box) and changing gradient flows (dashed arrow) in alternating phases. (b) In the roll-out phase, the KL-Divergence between the proxy student $\Hat{F}_S$ and teacher $F_T$ is used as a penalty term. (c) Additionally to the policy gradient, the teacher encoder is updated by backpropagating through the KL-Divergence between the action distribution of the teacher and the proxy student. (d) Using student observations, the proxy student is aligned to the student $F_S$ and the student to the teacher network.
  • Figure 1: Obstacle Avoidance Success Rates. The mean and standard deviation of the success rate for vision-based quadrotor flight obtained from three trainings.
  • Figure 2: The goal of the agent is to navigate from the start (grey point) to the goal (green cell). The environment consists of four types of cells: empty (white), lava (red), and path (blue). The teacher can see all cell types while the student can not distinguish between lava and path. A teacher trained without alignment finds its optimal path through the maze (a), which can not be imitated by the student trained without alignment (e), Behavior Cloning (b), and DAgger (c). In contrast, a teacher trained with alignment navigates around the maze (d), which can be easily copied by the student (f).
  • Figure 3: Vision-Based Obstacle Avoidance. On the left, the teacher trajectory rollouts of the vision-based quadrotor obstacle avoidance tasks are visualized. Our approach results in a policy behavior where the quadrotor adjusts the camera's viewing direction to capture sufficient environmental information for the student policy.
  • Figure 3: Manipulation Success Rates. The mean and standard deviation of the success rate for the task of opening the drawer obtained from five trainings runs.
  • ...and 3 more figures