Interaction-Grounded Learning for Contextual Markov Decision Processes with Personalized Feedback

Mengxiao Zhang; Yuheng Zhang; Haipeng Luo; Paul Mineiro

Interaction-Grounded Learning for Contextual Markov Decision Processes with Personalized Feedback

Mengxiao Zhang, Yuheng Zhang, Haipeng Luo, Paul Mineiro

TL;DR

This work extends Interaction-Grounded Learning to multi-turn contextual episodic MDPs with personalized feedback, addressing the gap from single-step analyses. It introduces a two-stage framework: a three-step reward decoder (reachable-state identification, inverse kinematics, Lipschitz estimation) and an IGW-based online policy learner that uses decoded rewards. The authors establish a sublinear regret guarantee, $\widetilde{\mathcal{O}}(T^{3/4})$, and validate the approach with experiments on synthetic MDPs and a real user-booking dataset, showing that the decoder provides a faithful lower bound on latent rewards and enables near-optimal policy learning. This framework enables learning from implicit feedback in complex sequential settings such as LLM-driven conversations, with practical impact in personalized, preference-aware decision processes.

Abstract

In this paper, we study Interaction-Grounded Learning (IGL) [Xie et al., 2021], a paradigm designed for realistic scenarios where the learner receives indirect feedback generated by an unknown mechanism, rather than explicit numerical rewards. While prior work on IGL provides efficient algorithms with provable guarantees, those results are confined to single-step settings, restricting their applicability to modern sequential decision-making systems such as multi-turn Large Language Model (LLM) deployments. To bridge this gap, we propose a computationally efficient algorithm that achieves a sublinear regret guarantee for contextual episodic Markov Decision Processes (MDPs) with personalized feedback. Technically, we extend the reward-estimator construction of Zhang et al. [2024a] from the single-step to the multi-step setting, addressing the unique challenges of decoding latent rewards under MDPs. Building on this estimator, we design an Inverse-Gap-Weighting (IGW) algorithm for policy optimization. Finally, we demonstrate the effectiveness of our method in learning personalized objectives from multi-turn interactions through experiments on both a synthetic episodic MDP and a real-world user booking dataset.

Interaction-Grounded Learning for Contextual Markov Decision Processes with Personalized Feedback

TL;DR

, and validate the approach with experiments on synthetic MDPs and a real user-booking dataset, showing that the decoder provides a faithful lower bound on latent rewards and enables near-optimal policy learning. This framework enables learning from implicit feedback in complex sequential settings such as LLM-driven conversations, with practical impact in personalized, preference-aware decision processes.

Abstract

Paper Structure (49 sections, 16 theorems, 72 equations, 2 figures, 4 algorithms)

This paper contains 49 sections, 16 theorems, 72 equations, 2 figures, 4 algorithms.

Introduction
Our Contribution
Related Works
Interaction-Grounded Learning and Learning from Implicit Feedback
Contextual Bandits and MDPs
Preliminaries
Learning Protocol.
Feedback Structure.
Realizability.
Identifiability for States.
Goal.
Other Notations.
Reward Decoder Learning
Step 1: Reachable State Identification
Step 2: Inverse Kinematic Learning
...and 34 more sections

Key Result

Lemma 3.0

Applying alg:homing to each $s\in{\mathcal{S}}_H$ with $N=\frac{C\cdot SKH\log(SKH/\delta)}{\varepsilon^2}$ episodes guarantees that with probability at least $1-\delta$, the output policy set $\{\widehat{\pi}_s\}_{s\in{\mathcal{S}}_H}$ satisfies $P^{\widehat{\pi}_s}(s)\geq p_s^\star - \varepsilon$

Figures (2)

Figure 1: Illustration of Lipschitz reward decoder $J(v,1)$ when $v_1 \in [0, 2/3]$, $v_3=1/3$, $K=3$, $M=1$, $\theta=0.6$, $c=0.2$, and $\kappa=0.2$.
Figure 2: Average reward during policy learning on synthetic and real datasets.

Theorems & Definitions (24)

Lemma 3.0
Lemma 3.0
Lemma 3.0
Lemma 3.0
Lemma 3.0
Lemma 3.0: Lemma 3 of zhang2024provably
Lemma 3.0
Theorem 4.3
Lemma A.0
proof
...and 14 more

Interaction-Grounded Learning for Contextual Markov Decision Processes with Personalized Feedback

TL;DR

Abstract

Interaction-Grounded Learning for Contextual Markov Decision Processes with Personalized Feedback

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (24)