Inverse Contextual Bandits without Rewards: Learning from a Non-Stationary Learner via Suffix Imitation

Yuqi Kong; Xiao Zhang; Weiran Shen

Inverse Contextual Bandits without Rewards: Learning from a Non-Stationary Learner via Suffix Imitation

Yuqi Kong, Xiao Zhang, Weiran Shen

TL;DR

It is shown that a reward-free observer can achieve a convergence rate of $\tilde O(1/\sqrt{N})$, matching the asymptotic efficiency of a fully reward-aware learner, which is comparable to that of the learner itself.

Abstract

We study the Inverse Contextual Bandit (ICB) problem, in which a learner seeks to optimize a policy while an observer, who cannot access the learner's rewards and only observes actions, aims to recover the underlying problem parameters. During the learning process, the learner's behavior naturally transitions from exploration to exploitation, resulting in non-stationary action data that poses significant challenges for the observer. To address this issue, we propose a simple and effective framework called Two-Phase Suffix Imitation. The framework discards data from an initial burn-in phase and performs empirical risk minimization using only data from a subsequent imitation phase. We derive a predictive decision loss bound that explicitly characterizes the bias-variance trade-off induced by the choice of burn-in length. Despite the severe information deficit, we show that a reward-free observer can achieve a convergence rate of $\tilde O(1/\sqrt{N})$, matching the asymptotic efficiency of a fully reward-aware learner. This result demonstrates that a passive observer can effectively uncover the optimal policy from actions alone, attaining performance comparable to that of the learner itself.

Inverse Contextual Bandits without Rewards: Learning from a Non-Stationary Learner via Suffix Imitation

TL;DR

It is shown that a reward-free observer can achieve a convergence rate of

, matching the asymptotic efficiency of a fully reward-aware learner, which is comparable to that of the learner itself.

Abstract

, matching the asymptotic efficiency of a fully reward-aware learner. This result demonstrates that a passive observer can effectively uncover the optimal policy from actions alone, attaining performance comparable to that of the learner itself.

Paper Structure (25 sections, 7 theorems, 41 equations, 10 figures, 1 algorithm)

This paper contains 25 sections, 7 theorems, 41 equations, 10 figures, 1 algorithm.

Introduction
Related Work
Preliminary
Linear Contextual Bandits
Inverse Contextual Bandits
Two-Phase Suffix Imitation
Learner's Algorithm
Observer's Algorithm
Provable Guarantees for the Observer’s Learned Policy
Experiments
Experimental Setup
Results
Impact of Burn-in Length.
Asymptotic Convergence and Interpretability.
Conclusion
...and 10 more sections

Key Result

Lemma 1

Assume $\|x_a\|_2\le 1$ for all $a,t$ and $\|\theta^\star\|_2\le 1$. Then for any policy $\pi$, the predictive regret is bounded by clean risk:

Figures (10)

Figure 1: Performance comparison with full horizon layout ($d=50$, $K$=200). The left block illustrates LinTS performance, while the right block displays LinUCB results.
Figure 2: Performance comparison of Observer strategies against the Learner baseline ($d=50$, $K$=200). Figure (a) shows the results for LinTS, and Figure (b) for LinUCB. Both metrics demonstrate that the Observer (Best Achieved) outperforms the online Learner.
Figure 3: Diagnostic verification of Assumption \ref{['ass:massartnoise']}: (a) learner actions become increasingly predictable over time; (b) a late-trained observer agrees more with the learner on late test windows than on early ones (95% CIs over 20 seeds).
Figure 4: Performance comparison with full horizon layout ($d=20$, $K=50$)
Figure 5: Performance comparison with full horizon layout ($d=20$, $K=100$)
...and 5 more figures

Theorems & Definitions (20)

Definition 1
Definition 2
Definition 3: Predictive Regret
Example 2: LinUCB
Example 3: LinTS
Lemma 1
proof
Lemma 2
proof
Definition 4: Natarajan Dimension
...and 10 more

Inverse Contextual Bandits without Rewards: Learning from a Non-Stationary Learner via Suffix Imitation

TL;DR

Abstract

Inverse Contextual Bandits without Rewards: Learning from a Non-Stationary Learner via Suffix Imitation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (20)