Driving Beyond Privilege: Distilling Dense-Reward Knowledge into Sparse-Reward Policies
Feeza Khan Khanzada, Jaerock Kwon
TL;DR
The paper addresses how to exploit dense, simulator-defined rewards to learn robust world models for autonomous driving while ensuring the deployed policy is optimized for sparse, deployment-aligned objectives. It introduces reward-privileged world-model distillation, a two-stage teacher-student framework where a dense-reward teacher shapes a latent dynamics model, and a sparse-reward student learns via distillation of the teacher’s latent space without policy imitation. Across lane following and overtaking tasks in CARLA, the sparse-reward students outperform dense teachers and sparse baselines, particularly on unseen routes, demonstrating enhanced generalization and safer driving. The work provides a practical design principle: use privileged rewards to improve representation and planning, but keep the final policy focused on deployment-relevant sparse metrics.
Abstract
We study how to exploit dense simulator-defined rewards in vision-based autonomous driving without inheriting their misalignment with deployment metrics. In realistic simulators such as CARLA, privileged state (e.g., lane geometry, infractions, time-to-collision) can be converted into dense rewards that stabilize and accelerate model-based reinforcement learning, but policies trained directly on these signals often overfit and fail to generalize when evaluated on sparse objectives such as route completion and collision-free overtaking. We propose reward-privileged world model distillation, a two-stage framework in which a teacher DreamerV3-style agent is first trained with a dense privileged reward, and only its latent dynamics are distilled into a student trained solely on sparse task rewards. Teacher and student share the same observation space (semantic bird's-eye-view images); privileged information enters only through the teacher's reward, and the student does not imitate the teacher's actions or value estimates. Instead, the student's world model is regularized to match the teacher's latent dynamics while its policy is learned from scratch on sparse success/failure signals. In CARLA lane-following and overtaking benchmarks, sparse-reward students outperform both dense-reward teachers and sparse-from-scratch baselines. On unseen lane-following routes, reward-privileged distillation improves success by about 23 percent relative to the dense teacher while maintaining comparable or better safety. On overtaking, students retain near-perfect performance on training routes and achieve up to a 27x improvement in success on unseen routes, with improved lane keeping. These results show that dense rewards can be leveraged to learn richer dynamics models while keeping the deployed policy optimized strictly for sparse, deployment-aligned objectives.
