A Reinforcement Learning Approach in Multi-Phase Second-Price Auction Design

Rui Ai; Boxiang Lyu; Zhaoran Wang; Zhuoran Yang; Michael I. Jordan

A Reinforcement Learning Approach in Multi-Phase Second-Price Auction Design

Rui Ai, Boxiang Lyu, Zhaoran Wang, Zhuoran Yang, Michael I. Jordan

TL;DR

A combination of a new technique named "buffer periods" and inspirations from Reinforcement Learning with low switching cost with low switching cost is used to limit bidders'surplus from untruthful bidding, thereby incentivizing approximately truthful bidding.

Abstract

We study reserve price optimization in multi-phase second price auctions, where the seller's prior actions affect the bidders' later valuations through a Markov Decision Process (MDP). Compared to the bandit setting in existing works, the setting in ours involves three challenges. First, from the seller's perspective, we need to efficiently explore the environment in the presence of potentially untruthful bidders who aim to manipulate the seller's policy. Second, we want to minimize the seller's revenue regret when the market noise distribution is unknown. Third, the seller's per-step revenue is an unknown, nonlinear random variable, and cannot even be directly observed from the environment but realized values. We propose a mechanism addressing all three challenges. To address the first challenge, we use a combination of a new technique named "buffer periods" and inspirations from Reinforcement Learning (RL) with low switching cost to limit bidders' surplus from untruthful bidding, thereby incentivizing approximately truthful bidding. The second one is tackled by a novel algorithm that removes the need for pure exploration when the market noise distribution is unknown. The third challenge is resolved by an extension of LSVI-UCB, where we use the auction's underlying structure to control the uncertainty of the revenue function. The three techniques culminate in the Contextual-LSVI-UCB-Buffer (CLUB) algorithm which achieves $\tilde{O}(H^{5/2}\sqrt{K})$ revenue regret, where $K$ is the number of episodes and $H$ is the length of each episode, when the market noise is known and $\tilde{O}(H^{3}\sqrt{K})$ revenue regret when the noise is unknown with no assumptions on bidders' truthfulness.

A Reinforcement Learning Approach in Multi-Phase Second-Price Auction Design

TL;DR

Abstract

revenue regret, where

is the number of episodes and

is the length of each episode, when the market noise is known and

revenue regret when the noise is unknown with no assumptions on bidders' truthfulness.

Paper Structure (47 sections, 41 theorems, 134 equations, 4 figures, 2 tables, 8 algorithms)

This paper contains 47 sections, 41 theorems, 134 equations, 4 figures, 2 tables, 8 algorithms.

Introduction
Related Works
Preliminaries
Known Market Noise Distribution
CLUB Algorithm When is Known
Regret Bound When is Known
Unknown Market Noise Distribution
CLUB Algorithm When is Unknown
Regret Bound of CLUB Algorithm When is Unknown
Proof Sketch
Regret Decomposition
Proof Techniques
Step 1: Limit the magnitude of untruthful reporting.
Step 2: Control the number of times $q_{ih}^{k}$ change due to untruthfulness.
Step 3: Prove the estimates of personal parameters and noise distribution are good.
...and 32 more sections

Key Result

Theorem 6

Under assumption:linearmdp, assumptionf, assumptionfdiff and logconcave, for any fixed failure probability $\delta\in(0,1)$, with probability at least $1 - \delta$, algo:KnownF achieves at most $\tilde{\mathcal{O}} (\sqrt{H^5K})$ revenue regret, where $\tilde{\mathcal{O}} (\cdot)$ hides only absolut

Figures (4)

Figure 1: Learning periods and buffer periods: $\mathtt{buffer.s(\cdot)}$ and $\mathtt{buffer.e(\cdot)}$ represent the start point and the end point of a buffer, respectively. Episode $k$ lays between $\mathtt{buffer.e(\tilde{k})}$ and $\mathtt{buffer.s(\tilde{k}+1)}$ and the length of each buffer is $\frac{3\log K}{\log(1/\gamma)}$.
Figure 2: Experiment results for the contextual bandit setting: \ref{['fig:sub1']} compares the revenue achieved by CLUB and benchmark (the maximum revenue when everything is common knowledge), showing CLUB obtains more than 98% revenue. \ref{['fig:sub2']} shows the sublinear regret associated with our CLUB algorithm as the curve trend is below linear. \ref{['fig:sub3']} exhibits that CLUB is comparable with NPAC-S, overwhelming SCORP.
Figure 3: Experiment results for the MDP setting: \ref{['fig:MDP+sub1']} compares the revenue achieved by CLUB and benchmark (the maximum revenue when everything is common knowledge), showing CLUB obtains more than 98% revenue. \ref{['fig:MDP+sub2']} shows the sublinear regret associated with our CLUB algorithm, as the curve trend is below linear. \ref{['fig:MDP+sub3']} exhibits that compared with NPAC-S, CLUB has less regret, testifying to its optimality.
Figure 4: Experiment results for the contextual bandit setting under truncated Gaussian noise distribution.

Theorems & Definitions (44)

Remark 5
Theorem 6
Theorem 8
Proposition 9
Lemma 10
Lemma 11
Lemma 12
Lemma 13
Lemma 14
Lemma 15
...and 34 more

A Reinforcement Learning Approach in Multi-Phase Second-Price Auction Design

TL;DR

Abstract

A Reinforcement Learning Approach in Multi-Phase Second-Price Auction Design

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (44)