Table of Contents
Fetching ...

Exploiting Expertise of Non-Expert and Diverse Agents in Social Bandit Learning: A Free Energy Approach

Erfan Mirzaei, Seyed Pooya Shariatpanahi, Alireza Tavakoli, Reshad Hosseini, Majid Nili Ahmadabadi

Abstract

Personalized AI-based services involve a population of individual reinforcement learning agents. However, most reinforcement learning algorithms focus on harnessing individual learning and fail to leverage the social learning capabilities commonly exhibited by humans and animals. Social learning integrates individual experience with observing others' behavior, presenting opportunities for improved learning outcomes. In this study, we focus on a social bandit learning scenario where a social agent observes other agents' actions without knowledge of their rewards. The agents independently pursue their own policy without explicit motivation to teach each other. We propose a free energy-based social bandit learning algorithm over the policy space, where the social agent evaluates others' expertise levels without resorting to any oracle or social norms. Accordingly, the social agent integrates its direct experiences in the environment and others' estimated policies. The theoretical convergence of our algorithm to the optimal policy is proven. Empirical evaluations validate the superiority of our social learning method over alternative approaches in various scenarios. Our algorithm strategically identifies the relevant agents, even in the presence of random or suboptimal agents, and skillfully exploits their behavioral information. In addition to societies including expert agents, in the presence of relevant but non-expert agents, our algorithm significantly enhances individual learning performance, where most related methods fail. Importantly, it also maintains logarithmic regret.

Exploiting Expertise of Non-Expert and Diverse Agents in Social Bandit Learning: A Free Energy Approach

Abstract

Personalized AI-based services involve a population of individual reinforcement learning agents. However, most reinforcement learning algorithms focus on harnessing individual learning and fail to leverage the social learning capabilities commonly exhibited by humans and animals. Social learning integrates individual experience with observing others' behavior, presenting opportunities for improved learning outcomes. In this study, we focus on a social bandit learning scenario where a social agent observes other agents' actions without knowledge of their rewards. The agents independently pursue their own policy without explicit motivation to teach each other. We propose a free energy-based social bandit learning algorithm over the policy space, where the social agent evaluates others' expertise levels without resorting to any oracle or social norms. Accordingly, the social agent integrates its direct experiences in the environment and others' estimated policies. The theoretical convergence of our algorithm to the optimal policy is proven. Empirical evaluations validate the superiority of our social learning method over alternative approaches in various scenarios. Our algorithm strategically identifies the relevant agents, even in the presence of random or suboptimal agents, and skillfully exploits their behavioral information. In addition to societies including expert agents, in the presence of relevant but non-expert agents, our algorithm significantly enhances individual learning performance, where most related methods fail. Importantly, it also maintains logarithmic regret.
Paper Structure (23 sections, 17 equations, 9 figures, 1 algorithm)

This paper contains 23 sections, 17 equations, 9 figures, 1 algorithm.

Figures (9)

  • Figure 1: Social bandit learning problem setting
  • Figure 2: Informational flow diagram of the proposed method (SBL-FE)
  • Figure 3: The cumulative regret performance of three social learning agents (OUCB, TUCB, SBL-FE) along with UCB and TS as baseline methods in societies consisting of one social learner and one non-learner. The experiments were conducted over 200 and 2000 trials for a 10-armed Bernoulli bandit problem with an optimality gap of $\Delta = 0.2$. In the zoomed-in view, we highlight that the performance of the TS method and our method are similar in some scenarios.
  • Figure 4: Cumulative regret performance of three social learning agents (OUCB, TUCB, SBL-FE) along with UCB and TS as baseline methods in societies consisting of one social learner and one individual learner. The experiments were conducted over 200 and 2000 trials for a 10-armed Bernoulli bandit problem with an optimality gap of $\Delta = 0.2$.
  • Figure 5: Per-trial free energy and selection probability of our social agent, SBL-FE, in different societal setups. 2000 trials were conducted for a 10-armed Bernoulli bandit problem with $\Delta = 0.2$.
  • ...and 4 more figures