Kernel Single-Index Bandits: Estimation, Inference, and Learning

Sakshi Arya; Satarupa Bhattacharjee; Bharath K. Sriperumbudur

Kernel Single-Index Bandits: Estimation, Inference, and Learning

Sakshi Arya, Satarupa Bhattacharjee, Bharath K. Sriperumbudur

Abstract

We study contextual bandits with finitely many actions in which the reward of each arm follows a single-index model with an arm-specific index parameter and an unknown nonparametric link function. We consider a regime in which arms correspond to stable decision options and covariates evolve adaptively under the bandit policy. This setting creates significant statistical challenges: the sampling distribution depends on the allocation rule, observations are dependent over time, and inverse-propensity weighting induces variance inflation. We propose a kernelized $\varepsilon$-greedy algorithm that combines Stein-based estimation of the index parameters with inverse-propensity-weighted kernel ridge regression for the reward functions. This approach enables flexible semiparametric learning while retaining interpretability. Our analysis develops new tools for inference with adaptively collected data. We establish asymptotic normality for the single-index estimator under adaptive sampling, yielding valid confidence regions, and derive a directional functional central limit theorem for the RKHS estimator, which provides asymptotically valid pointwise confidence intervals. The analysis relies on concentration bounds for inverse-weighted Gram matrices together with martingale central limit theorems. We further obtain finite-time regret guarantees, including $\tilde{O}(\sqrt{T})$ rates under common-link Lipschitz conditions, showing that semiparametric structure can be exploited without sacrificing statistical efficiency. These results provide a unified framework for simultaneous learning and inference in single-index contextual bandits.

Kernel Single-Index Bandits: Estimation, Inference, and Learning

Abstract

-greedy algorithm that combines Stein-based estimation of the index parameters with inverse-propensity-weighted kernel ridge regression for the reward functions. This approach enables flexible semiparametric learning while retaining interpretability. Our analysis develops new tools for inference with adaptively collected data. We establish asymptotic normality for the single-index estimator under adaptive sampling, yielding valid confidence regions, and derive a directional functional central limit theorem for the RKHS estimator, which provides asymptotically valid pointwise confidence intervals. The analysis relies on concentration bounds for inverse-weighted Gram matrices together with martingale central limit theorems. We further obtain finite-time regret guarantees, including

rates under common-link Lipschitz conditions, showing that semiparametric structure can be exploited without sacrificing statistical efficiency. These results provide a unified framework for simultaneous learning and inference in single-index contextual bandits.

Paper Structure (31 sections, 46 theorems, 455 equations, 11 figures, 7 tables, 3 algorithms)

This paper contains 31 sections, 46 theorems, 455 equations, 11 figures, 7 tables, 3 algorithms.

Introduction
Problem Setup
Algorithm
Estimation and Inference for the Index Parameter
Asymptotic Inference for the Index Parameter
Delta Method for Directional Inference
Finite-Sample Control of the Index Estimator
Estimation and Inference for the link function
Studentized CLT for Finite Linear Projection
Controlling Regularization Bias
Directional FCLT for the Unknown Link Function
Construction of Pointwise Confidence Intervals via RKHS CLT
Comparison with Uniform RKHS Bounds
Finite-time Regret Analysis
Regret Decomposition
...and 16 more sections

Key Result

Proposition 1

Let $X\in {\mathbb R}^d$ be a real-valued random vector with a differentiable density $p$, and let $g:{\mathbb R}^d\to {\mathbb R}$ be a continuous differentiable function such that $\mathbb{E}[\nabla g(X)]$ exists. Then it holds that,

Figures (11)

Figure 1: Empirical joint $95\%$ coverage for the single-index direction for both arms.
Figure 2: Nonparametric pointwise inference for $(d,\sigma)=(2,0.05)$. Left: empirical $95\%$ coverage of K-SIEGE and A&S intervals at the selected inference times. Right: corresponding average interval lengths for each arm.
Figure 3: Directional inference summary on the Rice Classification dataset.
Figure 4: Nonparametric inference summary on the Rice Classification dataset. (a) Estimated success curves with pointwise $95\%$ confidence intervals at $t=900$ for a representative run. (b) Distribution of pointwise CI widths across replications, by arm and time.
Figure 5: Link functions $g_1$ and $g_2$ as functions of the single-index $z$.
...and 6 more figures

Theorems & Definitions (100)

Remark 1
Proposition 1: First-order Non-Gaussian Stein's Identity bala_PMLR
Remark 2
Remark 3
Theorem 1: Asymptotic inference for the index parameter
Theorem 2
Corollary 1: Feasible studentization
Remark 4
Theorem 3: Studentized vector martingale CLT in $\mathcal{H}_K$
Corollary 2: Scalar standardized CLT
...and 90 more

Kernel Single-Index Bandits: Estimation, Inference, and Learning

Abstract

Kernel Single-Index Bandits: Estimation, Inference, and Learning

Authors

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (100)