Offline Reinforcement Learning using Human-Aligned Reward Labeling for Autonomous Emergency Braking in Occluded Pedestrian Crossing
Vinal Asodia, Zhenhua Feng, Saber Fallah
TL;DR
This work tackles the lack of reward labels in real-world offline reinforcement learning for autonomous driving by introducing a human-aligned reward-labeling pipeline that uses semantic maps to compute an adaptive safety component $c_t$. The two-phase approach first generates reward labels from real data and semantic cues, then trains a BPPO agent with spatial attention derived from semantic maps for longitudinal control. Key contributions include the adaptive safety mechanism with three risk factors, the integration of semantic maps for both reward labeling and attention, and a thorough evaluation on CARLA occluded-pedestrian scenarios with real-world A2D2 data, showing competitive performance and insights into human-aligned labeling discrepancies. The findings demonstrate that meaningful reward signals can be harvested from real datasets to enable end-to-end Offline RL, potentially reducing the sim-to-real gap and improving safety-conscious decision-making in autonomous vehicles.
Abstract
Effective leveraging of real-world driving datasets is crucial for enhancing the training of autonomous driving systems. While Offline Reinforcement Learning enables the training of autonomous vehicles using such data, most available datasets lack meaningful reward labels. Reward labeling is essential as it provides feedback for the learning algorithm to distinguish between desirable and undesirable behaviors, thereby improving policy performance. This paper presents a novel pipeline for generating human-aligned reward labels. The proposed approach addresses the challenge of absent reward signals in real-world datasets by generating labels that reflect human judgment and safety considerations. The pipeline incorporates an adaptive safety component, activated by analyzing semantic segmentation maps, allowing the autonomous vehicle to prioritize safety over efficiency in potential collision scenarios. The proposed pipeline is applied to an occluded pedestrian crossing scenario with varying levels of pedestrian traffic, using synthetic and simulation data. The results indicate that the generated reward labels closely match the simulation reward labels. When used to train the driving policy using Behavior Proximal Policy Optimisation, the results are competitive with other baselines. This demonstrates the effectiveness of our method in producing reliable and human-aligned reward signals, facilitating the training of autonomous driving systems through Reinforcement Learning outside of simulation environments and in alignment with human values.
