Table of Contents
Fetching ...

Offline Reinforcement Learning using Human-Aligned Reward Labeling for Autonomous Emergency Braking in Occluded Pedestrian Crossing

Vinal Asodia, Zhenhua Feng, Saber Fallah

TL;DR

This work tackles the lack of reward labels in real-world offline reinforcement learning for autonomous driving by introducing a human-aligned reward-labeling pipeline that uses semantic maps to compute an adaptive safety component $c_t$. The two-phase approach first generates reward labels from real data and semantic cues, then trains a BPPO agent with spatial attention derived from semantic maps for longitudinal control. Key contributions include the adaptive safety mechanism with three risk factors, the integration of semantic maps for both reward labeling and attention, and a thorough evaluation on CARLA occluded-pedestrian scenarios with real-world A2D2 data, showing competitive performance and insights into human-aligned labeling discrepancies. The findings demonstrate that meaningful reward signals can be harvested from real datasets to enable end-to-end Offline RL, potentially reducing the sim-to-real gap and improving safety-conscious decision-making in autonomous vehicles.

Abstract

Effective leveraging of real-world driving datasets is crucial for enhancing the training of autonomous driving systems. While Offline Reinforcement Learning enables the training of autonomous vehicles using such data, most available datasets lack meaningful reward labels. Reward labeling is essential as it provides feedback for the learning algorithm to distinguish between desirable and undesirable behaviors, thereby improving policy performance. This paper presents a novel pipeline for generating human-aligned reward labels. The proposed approach addresses the challenge of absent reward signals in real-world datasets by generating labels that reflect human judgment and safety considerations. The pipeline incorporates an adaptive safety component, activated by analyzing semantic segmentation maps, allowing the autonomous vehicle to prioritize safety over efficiency in potential collision scenarios. The proposed pipeline is applied to an occluded pedestrian crossing scenario with varying levels of pedestrian traffic, using synthetic and simulation data. The results indicate that the generated reward labels closely match the simulation reward labels. When used to train the driving policy using Behavior Proximal Policy Optimisation, the results are competitive with other baselines. This demonstrates the effectiveness of our method in producing reliable and human-aligned reward signals, facilitating the training of autonomous driving systems through Reinforcement Learning outside of simulation environments and in alignment with human values.

Offline Reinforcement Learning using Human-Aligned Reward Labeling for Autonomous Emergency Braking in Occluded Pedestrian Crossing

TL;DR

This work tackles the lack of reward labels in real-world offline reinforcement learning for autonomous driving by introducing a human-aligned reward-labeling pipeline that uses semantic maps to compute an adaptive safety component . The two-phase approach first generates reward labels from real data and semantic cues, then trains a BPPO agent with spatial attention derived from semantic maps for longitudinal control. Key contributions include the adaptive safety mechanism with three risk factors, the integration of semantic maps for both reward labeling and attention, and a thorough evaluation on CARLA occluded-pedestrian scenarios with real-world A2D2 data, showing competitive performance and insights into human-aligned labeling discrepancies. The findings demonstrate that meaningful reward signals can be harvested from real datasets to enable end-to-end Offline RL, potentially reducing the sim-to-real gap and improving safety-conscious decision-making in autonomous vehicles.

Abstract

Effective leveraging of real-world driving datasets is crucial for enhancing the training of autonomous driving systems. While Offline Reinforcement Learning enables the training of autonomous vehicles using such data, most available datasets lack meaningful reward labels. Reward labeling is essential as it provides feedback for the learning algorithm to distinguish between desirable and undesirable behaviors, thereby improving policy performance. This paper presents a novel pipeline for generating human-aligned reward labels. The proposed approach addresses the challenge of absent reward signals in real-world datasets by generating labels that reflect human judgment and safety considerations. The pipeline incorporates an adaptive safety component, activated by analyzing semantic segmentation maps, allowing the autonomous vehicle to prioritize safety over efficiency in potential collision scenarios. The proposed pipeline is applied to an occluded pedestrian crossing scenario with varying levels of pedestrian traffic, using synthetic and simulation data. The results indicate that the generated reward labels closely match the simulation reward labels. When used to train the driving policy using Behavior Proximal Policy Optimisation, the results are competitive with other baselines. This demonstrates the effectiveness of our method in producing reliable and human-aligned reward signals, facilitating the training of autonomous driving systems through Reinforcement Learning outside of simulation environments and in alignment with human values.

Paper Structure

This paper contains 30 sections, 6 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: An overview of the proposed pipeline with 2 phases. Phase 1 generates reward labels for the offline dataset, which involves passing the dashboard camera images through a ResNet18 UNet model to generate semantic maps that are used alongside vehicle sensor data to determine the adaptive safety component ($c_t$) of the reward function. Finally, the reward label is generated using $c_t$ and data from the vehicle sensors. Phase 2 is based on our previous work asodia2024human, where the camera images are passed through an autoencoder (the decoder highlighted in red is removed during inference) and a ResNet18 UNet model to generate semantic maps, which are used to provide spatial attention to the latent features. The resultant features are then passed to the BPPO agent to output the final longitudinal control.
  • Figure 2: An overview of the 3 risk factors used to determine the adaptive safety component $c_t$, along with camera dashboard images of examples of each occurrence. The 3 risk factors include; 1) Pedestrian Presence, which will be high if there's a pedestrian on the road and low if not, 2) Crossing Presence, which will be medium if the crossing is occluded and low if clear and 3) Pedestrian History, which will be medium if a pedestrian is detected and then suddenly disappears. The total risk is the summation of each risk factor and this value is passed through a sigmoid function. If the resultant value is above $\psi$ then $c_t = 1$ and otherwise $c_t = 0$.
  • Figure 3: Occluded pedestrian crossing scenario setup within CARLA. The ego vehicle, highlighted in green, is spawned at one end of the road and must navigate through a crossing occluded by a vehicle, highlighted in black. The ego vehicle must yield for a crossing pedestrian, highlighted in red, to successfully reach the goal point at the other end of the road.
  • Figure 4: Plot showcasing the distribution of reward labels taken from the simulation platform, CARLA (in Blue), and the distribution of reward labels generated by the proposed pipeline (in Orange).
  • Figure 5: Graph depicting the mean reward over training for the BPPO Sim Reward setup (in Blue) and the BPPO Gen Reward setup (in Orange).
  • ...and 4 more figures