Table of Contents
Fetching ...

Personalized Reinforcement Learning with a Budget of Policies

Dmitry Ivanov, Omer Ben-Porat

TL;DR

This work addresses personalization under stringent regulatory constraints by introducing represented MDPs (r-MDPs), which constrain the number of deployable policies to a budget $k<n$. It proposes two deep RL algorithms—an EM-like hard-assignment method and a soft, end-to-end assignment approach—grounded in a factorized optimization that alternates between assigning agents to representatives and training representative policies, with convergence guarantees to a local optimum. The methods are evaluated on Resource Gathering and MuJoCo tasks, showing that meaningful personalization is achievable even with small policy budgets and that the approaches outperform clustering baselines and random assignments. Overall, the paper offers a practical framework to balance personalization benefits with regulatory review costs, enabling scalable deployment of personalized decision-making in high-stakes domains.

Abstract

Personalization in machine learning (ML) tailors models' decisions to the individual characteristics of users. While this approach has seen success in areas like recommender systems, its expansion into high-stakes fields such as healthcare and autonomous driving is hindered by the extensive regulatory approval processes involved. To address this challenge, we propose a novel framework termed represented Markov Decision Processes (r-MDPs) that is designed to balance the need for personalization with the regulatory constraints. In an r-MDP, we cater to a diverse user population, each with unique preferences, through interaction with a small set of representative policies. Our objective is twofold: efficiently match each user to an appropriate representative policy and simultaneously optimize these policies to maximize overall social welfare. We develop two deep reinforcement learning algorithms that efficiently solve r-MDPs. These algorithms draw inspiration from the principles of classic K-means clustering and are underpinned by robust theoretical foundations. Our empirical investigations, conducted across a variety of simulated environments, showcase the algorithms' ability to facilitate meaningful personalization even under constrained policy budgets. Furthermore, they demonstrate scalability, efficiently adapting to larger policy budgets.

Personalized Reinforcement Learning with a Budget of Policies

TL;DR

This work addresses personalization under stringent regulatory constraints by introducing represented MDPs (r-MDPs), which constrain the number of deployable policies to a budget . It proposes two deep RL algorithms—an EM-like hard-assignment method and a soft, end-to-end assignment approach—grounded in a factorized optimization that alternates between assigning agents to representatives and training representative policies, with convergence guarantees to a local optimum. The methods are evaluated on Resource Gathering and MuJoCo tasks, showing that meaningful personalization is achievable even with small policy budgets and that the approaches outperform clustering baselines and random assignments. Overall, the paper offers a practical framework to balance personalization benefits with regulatory review costs, enabling scalable deployment of personalized decision-making in high-stakes domains.

Abstract

Personalization in machine learning (ML) tailors models' decisions to the individual characteristics of users. While this approach has seen success in areas like recommender systems, its expansion into high-stakes fields such as healthcare and autonomous driving is hindered by the extensive regulatory approval processes involved. To address this challenge, we propose a novel framework termed represented Markov Decision Processes (r-MDPs) that is designed to balance the need for personalization with the regulatory constraints. In an r-MDP, we cater to a diverse user population, each with unique preferences, through interaction with a small set of representative policies. Our objective is twofold: efficiently match each user to an appropriate representative policy and simultaneously optimize these policies to maximize overall social welfare. We develop two deep reinforcement learning algorithms that efficiently solve r-MDPs. These algorithms draw inspiration from the principles of classic K-means clustering and are underpinned by robust theoretical foundations. Our empirical investigations, conducted across a variety of simulated environments, showcase the algorithms' ability to facilitate meaningful personalization even under constrained policy budgets. Furthermore, they demonstrate scalability, efficiently adapting to larger policy budgets.
Paper Structure (30 sections, 3 theorems, 21 equations, 5 figures, 1 algorithm)

This paper contains 30 sections, 3 theorems, 21 equations, 5 figures, 1 algorithm.

Key Result

Theorem 1

Given an r-MDP, the EM-like algorithm converges to a local maximum of utilitarian social welfare.

Figures (5)

  • Figure 1: Paths that representatives learn in Resource Gathering after being trained with our EM algorithm for different $k$ and $n=25$ ($0$-th random seed). The representatives divide the map such that 1) each tile is visited by some policy and 2) policies jointly minimize the average episode length.
  • Figure 2: Performance of ours and baseline algorithms in Resource Gathering for different $k$ and $n=25$. The black dashed line represents the optimum for $k=25$. For each $k$, all algorithms are trained for 1 million transitions per policy. For $k=1$, all algorithms reduce to solving an MDP with a single policy. Confidence intervals represent standard errors.
  • Figure 3: Performance of ours and baseline algorithms in MuJoCo environments. For each $k$, all algorithms are trained for 2 million transitions per policy. The number of agents is $n=1000$ for $k=50$ and $n=100$ for smaller $k$. For $k=1$, all algorithms reduce to solving an MDP with a single policy. Confidence intervals represent standard errors.
  • Figure 4: Histograms of agent assignments learned by ours and baseline algorithms for $n=100$, $k=5$ in HalfCheetah ($0$-th random seed). Each color denotes one of five representatives and bars of this color denote the target velocities of agents assigned to this representative. The expected behavior is a division of the agents' velocities into five intervals of similar sizes, one for each representative. Histograms for other environments are reported in the Appendix.
  • Figure 5: Histograms of agent assignments learned by ours and baseline algorithms for $n=100$, $k=5$ in Ant, Hopper, and Walker2d ($0$-th random seed). Each color denotes one of five representatives and bars of this color denote the target velocities of agents assigned to this representative. The expected behavior is a division of the agents' velocities into five intervals of similar sizes, one for each representative.

Theorems & Definitions (5)

  • Theorem 1
  • Lemma 1
  • proof
  • Theorem 2
  • proof