Table of Contents
Fetching ...

Decentralized Blockchain-based Robust Multi-agent Multi-armed Bandit

Mengfan Xu, Diego Klabjan

TL;DR

The paper tackles robust multi-agent MAB in a fully decentralized, blockchain-based setting with malicious participants. It introduces a two-phase framework (burn-in and learning) that uses a UCB-like arm selection, blockchain-driven validator/commander selection, MPC-enabled privacy, and a novel cost mechanism to incentivize participation and deter manipulation. The core contributions include a fully specified algorithmic framework, a Byzantine-resilient consensus protocol, a reputation-based validator selection system, and a rigorous regret analysis showing $O(\log T)$ regret under various adversarial configurations, aligning with classical bounds in robust multi-agent MAB. The work demonstrates that decentralization and security can be intrinsically tied to learning optimality, with practical implications for secure online decision making in distributed systems and future integration of mechanism design with learning.

Abstract

We study a robust, i.e. in presence of malicious participants, multi-agent multi-armed bandit problem where multiple participants are distributed on a fully decentralized blockchain, with the possibility of some being malicious. The rewards of arms are homogeneous among the honest participants, following time-invariant stochastic distributions, which are revealed to the participants only when certain conditions are met to ensure that the coordination mechanism is secure enough. The coordination mechanism's objective is to efficiently ensure the cumulative rewards gained by the honest participants are maximized. To this end, we are the first to incorporate advanced techniques from blockchains, as well as novel mechanisms, into such a cooperative decision making framework to design optimal strategies for honest participants. This framework allows various malicious behaviors and the maintenance of security and participant privacy. More specifically, we select a pool of validators who communicate to all participants, design a new consensus mechanism based on digital signatures for these validators, invent a UCB-based strategy that requires less information from participants through secure multi-party computation, and design the chain-participant interaction and an incentive mechanism to encourage participants' participation. Notably, we are the first to prove the theoretical regret of the proposed algorithm and claim its optimality. Unlike existing work that integrates blockchains with learning problems such as federated learning which mainly focuses on optimality via computational experiments, we demonstrate that the regret of honest participants is upper bounded by $\log{T}$ under certain assumptions. The regret bound is consistent with the multi-agent multi-armed bandit problem, both without malicious participants and with purely Byzantine attacks which do not affect the entire system.

Decentralized Blockchain-based Robust Multi-agent Multi-armed Bandit

TL;DR

The paper tackles robust multi-agent MAB in a fully decentralized, blockchain-based setting with malicious participants. It introduces a two-phase framework (burn-in and learning) that uses a UCB-like arm selection, blockchain-driven validator/commander selection, MPC-enabled privacy, and a novel cost mechanism to incentivize participation and deter manipulation. The core contributions include a fully specified algorithmic framework, a Byzantine-resilient consensus protocol, a reputation-based validator selection system, and a rigorous regret analysis showing regret under various adversarial configurations, aligning with classical bounds in robust multi-agent MAB. The work demonstrates that decentralization and security can be intrinsically tied to learning optimality, with practical implications for secure online decision making in distributed systems and future integration of mechanism design with learning.

Abstract

We study a robust, i.e. in presence of malicious participants, multi-agent multi-armed bandit problem where multiple participants are distributed on a fully decentralized blockchain, with the possibility of some being malicious. The rewards of arms are homogeneous among the honest participants, following time-invariant stochastic distributions, which are revealed to the participants only when certain conditions are met to ensure that the coordination mechanism is secure enough. The coordination mechanism's objective is to efficiently ensure the cumulative rewards gained by the honest participants are maximized. To this end, we are the first to incorporate advanced techniques from blockchains, as well as novel mechanisms, into such a cooperative decision making framework to design optimal strategies for honest participants. This framework allows various malicious behaviors and the maintenance of security and participant privacy. More specifically, we select a pool of validators who communicate to all participants, design a new consensus mechanism based on digital signatures for these validators, invent a UCB-based strategy that requires less information from participants through secure multi-party computation, and design the chain-participant interaction and an incentive mechanism to encourage participants' participation. Notably, we are the first to prove the theoretical regret of the proposed algorithm and claim its optimality. Unlike existing work that integrates blockchains with learning problems such as federated learning which mainly focuses on optimality via computational experiments, we demonstrate that the regret of honest participants is upper bounded by under certain assumptions. The regret bound is consistent with the multi-agent multi-armed bandit problem, both without malicious participants and with purely Byzantine attacks which do not affect the entire system.
Paper Structure (38 sections, 7 theorems, 160 equations, 1 figure, 5 algorithms)

This paper contains 38 sections, 7 theorems, 160 equations, 1 figure, 5 algorithms.

Key Result

Theorem 1

Let us assume that the total number of honest participants is at least $\frac{2}{3}M$. Let us assume that there is at least one honest participant in the validator set. Meanwhile, let us assume that the malicious participants perform existential forgery on the signatures of honest participants with where $L$ is the length of the burn-in period of order $\log{T}$, $c > 0$ is the cost, $C_1$ meets

Figures (1)

  • Figure 1: The flow of the algorithm

Theorems & Definitions (32)

  • Theorem 1
  • proof : Proof sketch
  • Remark
  • Remark
  • Remark
  • Remark
  • Remark
  • Remark
  • Theorem 2
  • proof : Proof Sketch
  • ...and 22 more