Interaction-Grounded Learning for Contextual Markov Decision Processes with Personalized Feedback
Mengxiao Zhang, Yuheng Zhang, Haipeng Luo, Paul Mineiro
TL;DR
This work extends Interaction-Grounded Learning to multi-turn contextual episodic MDPs with personalized feedback, addressing the gap from single-step analyses. It introduces a two-stage framework: a three-step reward decoder (reachable-state identification, inverse kinematics, Lipschitz estimation) and an IGW-based online policy learner that uses decoded rewards. The authors establish a sublinear regret guarantee, $\widetilde{\mathcal{O}}(T^{3/4})$, and validate the approach with experiments on synthetic MDPs and a real user-booking dataset, showing that the decoder provides a faithful lower bound on latent rewards and enables near-optimal policy learning. This framework enables learning from implicit feedback in complex sequential settings such as LLM-driven conversations, with practical impact in personalized, preference-aware decision processes.
Abstract
In this paper, we study Interaction-Grounded Learning (IGL) [Xie et al., 2021], a paradigm designed for realistic scenarios where the learner receives indirect feedback generated by an unknown mechanism, rather than explicit numerical rewards. While prior work on IGL provides efficient algorithms with provable guarantees, those results are confined to single-step settings, restricting their applicability to modern sequential decision-making systems such as multi-turn Large Language Model (LLM) deployments. To bridge this gap, we propose a computationally efficient algorithm that achieves a sublinear regret guarantee for contextual episodic Markov Decision Processes (MDPs) with personalized feedback. Technically, we extend the reward-estimator construction of Zhang et al. [2024a] from the single-step to the multi-step setting, addressing the unique challenges of decoding latent rewards under MDPs. Building on this estimator, we design an Inverse-Gap-Weighting (IGW) algorithm for policy optimization. Finally, we demonstrate the effectiveness of our method in learning personalized objectives from multi-turn interactions through experiments on both a synthetic episodic MDP and a real-world user booking dataset.
