Provably Efficient Interactive-Grounded Learning with Personalized Reward

Mengxiao Zhang; Yuheng Zhang; Haipeng Luo; Paul Mineiro

Provably Efficient Interactive-Grounded Learning with Personalized Reward

Mengxiao Zhang, Yuheng Zhang, Haipeng Luo, Paul Mineiro

TL;DR

The paper advances Interactive-Grounded Learning by addressing personalized, context-dependent rewards and proving provable sublinear regret under realizability. It introduces a Lipschitz reward estimator derived from uniform exploration via inverse kinematics and ERM, enabling two regret guarantees: an off-policy Explore-then-Exploit approach with $\mathcal{O}\left(T^{2/3}\right)$ regret and an on-policy inverse-gap weighting method with a similar bound plus a regression term. The methods are validated on image (MNIST) and text (conversational) feedback tasks, showing the Lipschitz estimator outperforms a binary alternative and that the learned policies closely approximate the optimal under the constructed rewards. This work provides a principled bridge between personalized contextual feedback and efficient learning with practical impact for reward-free, interactive systems.

Abstract

Interactive-Grounded Learning (IGL) [Xie et al., 2021] is a powerful framework in which a learner aims at maximizing unobservable rewards through interacting with an environment and observing reward-dependent feedback on the taken actions. To deal with personalized rewards that are ubiquitous in applications such as recommendation systems, Maghakian et al. [2022] study a version of IGL with context-dependent feedback, but their algorithm does not come with theoretical guarantees. In this work, we consider the same problem and provide the first provably efficient algorithms with sublinear regret under realizability. Our analysis reveals that the step-function estimator of prior work can deviate uncontrollably due to finite-sample effects. Our solution is a novel Lipschitz reward estimator which underestimates the true reward and enjoys favorable generalization performances. Building on this estimator, we propose two algorithms, one based on explore-then-exploit and the other based on inverse-gap weighting. We apply IGL to learning from image feedback and learning from text feedback, which are reward-free settings that arise in practice. Experimental results showcase the importance of using our Lipschitz reward estimator and the overall effectiveness of our algorithms.

Provably Efficient Interactive-Grounded Learning with Personalized Reward

TL;DR

regret and an on-policy inverse-gap weighting method with a similar bound plus a regression term. The methods are validated on image (MNIST) and text (conversational) feedback tasks, showing the Lipschitz estimator outperforms a binary alternative and that the learned policies closely approximate the optimal under the constructed rewards. This work provides a principled bridge between personalized contextual feedback and efficient learning with practical impact for reward-free, interactive systems.

Abstract

Paper Structure (39 sections, 12 theorems, 43 equations, 1 figure, 1 table, 2 algorithms)

This paper contains 39 sections, 12 theorems, 43 equations, 1 figure, 1 table, 2 algorithms.

Introduction
Contributions.
Related Work
Interaction-Grounded Learning (IGL).
Contextual online learning with partial feedback.
Preliminary
Problem setup.
Feedback dependence assumption.
Realizability.
Identifiability.
Regret.
Other notations.
Methodology
Reward Estimator Construction via Uniform Exploration
Inverse Kinematics
...and 24 more sections

Key Result

Lemma 1

For any context $x\in{\mathcal{X}}$, suppose that the learner picks a uniformly random action $a\in[K]$. Let $r$ and $y$ be its realized reward and the corresponding feedback. Then, under assum:feedback and assum:realizability, the posterior distribution of $a$ given the context $x$ and feedback $y$ where $f^\star$ and $\phi^\star$ are the true expected reward and feedback decoder defined in assum

Figures (1)

Figure 1: Performance of alg:on_IGL under true (unobserved) rewards and constructed rewards. Left figure: Results on MNIST dataset. Right figure: Results on our conversational dataset.

Theorems & Definitions (19)

Lemma 1
Lemma 2
proof
Lemma 3
Lemma 4
Theorem 1
Theorem 2
Lemma 4
proof
Lemma 4
...and 9 more

Provably Efficient Interactive-Grounded Learning with Personalized Reward

TL;DR

Abstract

Provably Efficient Interactive-Grounded Learning with Personalized Reward

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (19)