Table of Contents
Fetching ...

Structured Reinforcement Learning for Incentivized Stochastic Covert Optimization

Adit Jain, Vikram Krishnamurthy

TL;DR

This work addresses hiding the learner’s SG estimate from an eavesdropper by dynamically switching between learning and obfuscation SGs within a finite-horizon MDP. It proves that, under mild interval-dominance-like conditions, the optimal policy has a monotone threshold structure in the remaining learning steps, and provides practical methods (SPSA and bandits) to estimate this policy. The approach is instantiated in a covert federated learning task for hate-speech classification, showing substantial gains in obfuscation effectiveness and reduced incentive expenditure compared to greedy or random policies. The framework has potential for broader distributed optimization settings where protecting the learning trajectory from adversaries is critical, with extensions to Bayesian social-learning scenarios suggested for future work.

Abstract

This paper studies how a stochastic gradient algorithm (SG) can be controlled to hide the estimate of the local stationary point from an eavesdropper. Such problems are of significant interest in distributed optimization settings like federated learning and inventory management. A learner queries a stochastic oracle and incentivizes the oracle to obtain noisy gradient measurements and perform SG. The oracle probabilistically returns either a noisy gradient of the function} or a non-informative measurement, depending on the oracle state and incentive. The learner's query and incentive are visible to an eavesdropper who wishes to estimate the stationary point. This paper formulates the problem of the learner performing covert optimization by dynamically incentivizing the stochastic oracle and obfuscating the eavesdropper as a finite-horizon Markov decision process (MDP). Using conditions for interval-dominance on the cost and transition probability structure, we show that the optimal policy for the MDP has a monotone threshold structure. We propose searching for the optimal stationary policy with the threshold structure using a stochastic approximation algorithm and a multi-armed bandit approach. The effectiveness of our methods is numerically demonstrated on a covert federated learning hate-speech classification task.

Structured Reinforcement Learning for Incentivized Stochastic Covert Optimization

TL;DR

This work addresses hiding the learner’s SG estimate from an eavesdropper by dynamically switching between learning and obfuscation SGs within a finite-horizon MDP. It proves that, under mild interval-dominance-like conditions, the optimal policy has a monotone threshold structure in the remaining learning steps, and provides practical methods (SPSA and bandits) to estimate this policy. The approach is instantiated in a covert federated learning task for hate-speech classification, showing substantial gains in obfuscation effectiveness and reduced incentive expenditure compared to greedy or random policies. The framework has potential for broader distributed optimization settings where protecting the learning trajectory from adversaries is critical, with extensions to Bayesian social-learning scenarios suggested for future work.

Abstract

This paper studies how a stochastic gradient algorithm (SG) can be controlled to hide the estimate of the local stationary point from an eavesdropper. Such problems are of significant interest in distributed optimization settings like federated learning and inventory management. A learner queries a stochastic oracle and incentivizes the oracle to obtain noisy gradient measurements and perform SG. The oracle probabilistically returns either a noisy gradient of the function} or a non-informative measurement, depending on the oracle state and incentive. The learner's query and incentive are visible to an eavesdropper who wishes to estimate the stationary point. This paper formulates the problem of the learner performing covert optimization by dynamically incentivizing the stochastic oracle and obfuscating the eavesdropper as a finite-horizon Markov decision process (MDP). Using conditions for interval-dominance on the cost and transition probability structure, we show that the optimal policy for the MDP has a monotone threshold structure. We propose searching for the optimal stationary policy with the threshold structure using a stochastic approximation algorithm and a multi-armed bandit approach. The effectiveness of our methods is numerically demonstrated on a covert federated learning hate-speech classification task.
Paper Structure (15 sections, 3 theorems, 19 equations, 1 figure, 1 table, 1 algorithm)

This paper contains 15 sections, 3 theorems, 19 equations, 1 figure, 1 table, 1 algorithm.

Key Result

Theorem 1

For an oracle with assumptions (O1-O3), to obtain an estimate $\hat{x}$ which achieves the objective eq:learnerobjective, the learner needs to perform $M$ successful gradient steps (Def. def:succgradientstep) with a step size ($\mu = \min(\frac{1}{\gamma},\frac{\epsilon}{2\sigma^2\gamma})$) where $M

Figures (1)

  • Figure 1: Dynamic Covert Optimization: Learner sends query $q_k$ and incentive $i_k$ to oracle in state $o_k$. The oracle evaluates noisy gradient of $f$ at $q_k$, $r_k$ according to \ref{['eq:oraclereply']}. An eavesdropper observes $q_k$ and $i_k$ and aims to approximate the learner's estimate. The learner needs to control the incentive $i_k$ and type of SG ($a_k$) to query using \ref{['eq:query']} to achieve the learning objective of \ref{['eq:learnerobjective']} and obfuscate the eavesdropper with belief \ref{['eq:eavesestimate']}.

Theorems & Definitions (5)

  • Definition 1
  • Theorem 1
  • Theorem 2
  • proof
  • Theorem 3