Table of Contents
Fetching ...

Driving Beyond Privilege: Distilling Dense-Reward Knowledge into Sparse-Reward Policies

Feeza Khan Khanzada, Jaerock Kwon

TL;DR

The paper addresses how to exploit dense, simulator-defined rewards to learn robust world models for autonomous driving while ensuring the deployed policy is optimized for sparse, deployment-aligned objectives. It introduces reward-privileged world-model distillation, a two-stage teacher-student framework where a dense-reward teacher shapes a latent dynamics model, and a sparse-reward student learns via distillation of the teacher’s latent space without policy imitation. Across lane following and overtaking tasks in CARLA, the sparse-reward students outperform dense teachers and sparse baselines, particularly on unseen routes, demonstrating enhanced generalization and safer driving. The work provides a practical design principle: use privileged rewards to improve representation and planning, but keep the final policy focused on deployment-relevant sparse metrics.

Abstract

We study how to exploit dense simulator-defined rewards in vision-based autonomous driving without inheriting their misalignment with deployment metrics. In realistic simulators such as CARLA, privileged state (e.g., lane geometry, infractions, time-to-collision) can be converted into dense rewards that stabilize and accelerate model-based reinforcement learning, but policies trained directly on these signals often overfit and fail to generalize when evaluated on sparse objectives such as route completion and collision-free overtaking. We propose reward-privileged world model distillation, a two-stage framework in which a teacher DreamerV3-style agent is first trained with a dense privileged reward, and only its latent dynamics are distilled into a student trained solely on sparse task rewards. Teacher and student share the same observation space (semantic bird's-eye-view images); privileged information enters only through the teacher's reward, and the student does not imitate the teacher's actions or value estimates. Instead, the student's world model is regularized to match the teacher's latent dynamics while its policy is learned from scratch on sparse success/failure signals. In CARLA lane-following and overtaking benchmarks, sparse-reward students outperform both dense-reward teachers and sparse-from-scratch baselines. On unseen lane-following routes, reward-privileged distillation improves success by about 23 percent relative to the dense teacher while maintaining comparable or better safety. On overtaking, students retain near-perfect performance on training routes and achieve up to a 27x improvement in success on unseen routes, with improved lane keeping. These results show that dense rewards can be leveraged to learn richer dynamics models while keeping the deployed policy optimized strictly for sparse, deployment-aligned objectives.

Driving Beyond Privilege: Distilling Dense-Reward Knowledge into Sparse-Reward Policies

TL;DR

The paper addresses how to exploit dense, simulator-defined rewards to learn robust world models for autonomous driving while ensuring the deployed policy is optimized for sparse, deployment-aligned objectives. It introduces reward-privileged world-model distillation, a two-stage teacher-student framework where a dense-reward teacher shapes a latent dynamics model, and a sparse-reward student learns via distillation of the teacher’s latent space without policy imitation. Across lane following and overtaking tasks in CARLA, the sparse-reward students outperform dense teachers and sparse baselines, particularly on unseen routes, demonstrating enhanced generalization and safer driving. The work provides a practical design principle: use privileged rewards to improve representation and planning, but keep the final policy focused on deployment-relevant sparse metrics.

Abstract

We study how to exploit dense simulator-defined rewards in vision-based autonomous driving without inheriting their misalignment with deployment metrics. In realistic simulators such as CARLA, privileged state (e.g., lane geometry, infractions, time-to-collision) can be converted into dense rewards that stabilize and accelerate model-based reinforcement learning, but policies trained directly on these signals often overfit and fail to generalize when evaluated on sparse objectives such as route completion and collision-free overtaking. We propose reward-privileged world model distillation, a two-stage framework in which a teacher DreamerV3-style agent is first trained with a dense privileged reward, and only its latent dynamics are distilled into a student trained solely on sparse task rewards. Teacher and student share the same observation space (semantic bird's-eye-view images); privileged information enters only through the teacher's reward, and the student does not imitate the teacher's actions or value estimates. Instead, the student's world model is regularized to match the teacher's latent dynamics while its policy is learned from scratch on sparse success/failure signals. In CARLA lane-following and overtaking benchmarks, sparse-reward students outperform both dense-reward teachers and sparse-from-scratch baselines. On unseen lane-following routes, reward-privileged distillation improves success by about 23 percent relative to the dense teacher while maintaining comparable or better safety. On overtaking, students retain near-perfect performance on training routes and achieve up to a 27x improvement in success on unseen routes, with improved lane keeping. These results show that dense rewards can be leveraged to learn richer dynamics models while keeping the deployed policy optimized strictly for sparse, deployment-aligned objectives.

Paper Structure

This paper contains 56 sections, 15 equations, 1 figure, 6 tables.

Figures (1)

  • Figure 1: Two–stage training pipeline. Left: Reward‑privileged teacher. An actor–critic interacts with the CARLA simulator using actions $a_t$, receives images $o_t$ and dense reward $r_t^{\text{dense}}$, and stores transitions in a replay buffer. A world model is trained on this buffer and provides latent states $(h_t, z_t)$ to the actor–critic. Right: Reward‑sparse student. A second actor–critic interacts with CARLA using only sparse reward $r_t^{\text{sparse}}$. The replay buffer is shared between a student world model and a frozen copy of the teacher world model; the student is regularized to match the teacher’s latents, while its own latents drive policy learning.