Social Interpretable Reinforcement Learning
Leonardo Lucio Custode, Giovanni Iacca
TL;DR
This work tackles the interpretability-cost bottleneck in reinforcement learning by introducing Social Interpretable RL (SIRL), a two-phase, socially inspired learning framework. In the collaborative phase, a population of DT-based policies acts in parallel on a shared environment and votes on a single deployed action, after which leaves are updated via Q-learning; in the subsequent individual phase, each DT learns on its own environment to refine its performance, with the outer loop using Grammatical Evolution to optimize DT structures. Empirical results across six Gymnasium benchmarks show that SIRL reduces the number of environment interactions by $43\%-76\%$, accelerates convergence, and often matches or surpasses the baseline interpretable method (ELDT), while preserving interpretability. These findings demonstrate that socially guided learning can substantially raise sample efficiency and policy quality in interpretable RL, narrowing the gap with non-interpretable approaches while maintaining transparency.
Abstract
Reinforcement Learning (RL) bears the promise of being a game-changer in many applications. However, since most of the literature in the field is currently focused on opaque models, the use of RL in high-stakes scenarios, where interpretability is crucial, is still limited. Recently, some approaches to interpretable RL, e.g., based on Decision Trees, have been proposed, but one of the main limitations of these techniques is their training cost. To overcome this limitation, we propose a new method, called Social Interpretable RL (SIRL), that can substantially reduce the number of episodes needed for training. Our method mimics a social learning process, where each agent in a group learns to solve a given task based both on its own individual experience as well as the experience acquired together with its peers. Our approach is divided into the following two phases. (1) In the collaborative phase, all the agents in the population interact with a shared instance of the environment, where each agent observes the state and independently proposes an action. Then, voting is performed to choose the action that will actually be deployed in the environment. (2) In the individual phase, then, each agent refines its individual performance by interacting with its own instance of the environment. This mechanism makes the agents experience a larger number of episodes with little impact on the computational cost of the process. Our results (on 6 widely-known RL benchmarks) show that SIRL not only reduces the computational cost by a factor varying from a minimum of 43% to a maximum 76%, but it also increases the convergence speed and, often, improves the quality of the solutions.
