Multi-Type Preference Learning: Empowering Preference-Based Reinforcement Learning with Equal Preferences

Ziang Liu; Junjie Xu; Xingjiao Wu; Jing Yang; Liang He

Multi-Type Preference Learning: Empowering Preference-Based Reinforcement Learning with Equal Preferences

Ziang Liu, Junjie Xu, Xingjiao Wu, Jing Yang, Liang He

TL;DR

A novel PBRL method, Multi-Type Preference Learning (MTPL), which allows simultaneous learning from equal preferences while leveraging existing methods for learning from explicit preferences and indicates that simultaneous learning from both equal and explicit preferences enables the PBRL method to more comprehensively understand the feedback from teachers, thereby enhancing feedback efficiency.

Abstract

Preference-Based reinforcement learning (PBRL) learns directly from the preferences of human teachers regarding agent behaviors without needing meticulously designed reward functions. However, existing PBRL methods often learn primarily from explicit preferences, neglecting the possibility that teachers may choose equal preferences. This neglect may hinder the understanding of the agent regarding the task perspective of the teacher, leading to the loss of important information. To address this issue, we introduce the Equal Preference Learning Task, which optimizes the neural network by promoting similar reward predictions when the behaviors of two agents are labeled as equal preferences. Building on this task, we propose a novel PBRL method, Multi-Type Preference Learning (MTPL), which allows simultaneous learning from equal preferences while leveraging existing methods for learning from explicit preferences. To validate our approach, we design experiments applying MTPL to four existing state-of-the-art baselines across ten locomotion and robotic manipulation tasks in the DeepMind Control Suite. The experimental results indicate that simultaneous learning from both equal and explicit preferences enables the PBRL method to more comprehensively understand the feedback from teachers, thereby enhancing feedback efficiency. Project page: \url{https://github.com/FeiCuiLengMMbb/paper_MTPL}

Multi-Type Preference Learning: Empowering Preference-Based Reinforcement Learning with Equal Preferences

TL;DR

Abstract

Paper Structure (10 sections, 4 equations, 6 figures, 2 tables)

This paper contains 10 sections, 4 equations, 6 figures, 2 tables.

RELATED WORK
PRELIMINARIES
Multi-Type Preference Learning
Explicit Preference Learning Task
Equal Preference Learning Task
EXPERIMENTS
SETUPS
MAIN RESULTS
ABLATION STUDIES
CONCLUSIONS

Figures (6)

Figure 1: Illustration of MTPL. The agent interacts with the environment and simultaneously learns the reward function $\hat{r}_\psi$ from both equal and explicit preferences. State-action sequences are sampled by interacting with the environment, where rewards are labeled by $\hat{r}_\psi$, and transitions are sampled from the replay buffer to optimize the policy.
Figure 2: Examples of the behaviors of four agents in the $Point\_mass\_easy$ task. The objective is for the agent to guide the yellow ball to the central red area. The yellow dashed line represents the agent's path, while the green dot indicates the agent's final stopping position. TR denotes the score given by the true reward function of the simulation environment for the agent's actions, and LR denotes the score from the fitted reward function $\hat{r}_\psi$.
Figure 3: Comparison of Pearson Correlation Coefficients between Learned and True Reward Functions across 10 Tasks. This figure compares the Pearson correlation coefficients between the learned reward function outputs and the true reward function outputs for four baseline methods and the proposed method using MTPL. Each bar represents the average correlation coefficient over five independent runs, and the gray lines indicate the standard deviation.
Figure 4: Spearman Correlation Analysis of Performance Improvement and the Proportion of Equal Preference Feedback. This figure presents the Spearman correlation analysis between performance improvement and the proportion of equal preference feedback after applying MTPL to four baseline methods across 10 tasks. To display extreme values, the y-axis has been logarithmically scaled.
Figure 5: Hyperparameter Analysis of MTPL. Figure a and Figure b analyze the performance of different parameters $\alpha^{Equal}$ on tasks $Walker\_walk$ and $Pendulum\_swingup$, while Figure c examines the impact of the SimTeacher parameter $\alpha$ on performance.
...and 1 more figures

Multi-Type Preference Learning: Empowering Preference-Based Reinforcement Learning with Equal Preferences

TL;DR

Abstract

Multi-Type Preference Learning: Empowering Preference-Based Reinforcement Learning with Equal Preferences

Authors

TL;DR

Abstract

Table of Contents

Figures (6)