Table of Contents
Fetching ...

Provable Interactive Learning with Hindsight Instruction Feedback

Dipendra Misra, Aldo Pacchiano, Robert E. Schapire

TL;DR

This work formalizes interactive learning from hindsight instruction (LHI), where a teacher labels an agent's response with a suitable instruction instead of providing explicit expert actions or rewards. The authors prove a general lower bound showing regret can scale with the size of the response space, motivating structural assumptions. They then introduce LORIL, a low-rank aware algorithm that achieves a sublinear regret of order $\\tilde{O}(\\sqrt{T})$ with dependence on the intrinsic rank $d$ rather than the size of the action space, and provide finite-sample regret guarantees under realizability and bounded-feature assumptions. Empirical results on synthetic and image-instruction tasks show LORIL outperforms baselines and remains effective even when the low-rank condition is violated, highlighting the potential of hindsight-label feedback for efficient instruction following. The work lays a foundation for further algorithmic development in hindsight-based supervision with practical implications for robotics and language-grounded systems.

Abstract

We study interactive learning in a setting where the agent has to generate a response (e.g., an action or trajectory) given a context and an instruction. In contrast, to typical approaches that train the system using reward or expert supervision on response, we study learning with hindsight instruction where a teacher provides an instruction that is most suitable for the agent's generated response. This hindsight labeling of instruction is often easier to provide than providing expert supervision of the optimal response which may require expert knowledge or can be impractical to elicit. We initiate the theoretical analysis of interactive learning with hindsight labeling. We first provide a lower bound showing that in general, the regret of any algorithm must scale with the size of the agent's response space. We then study a specialized setting where the underlying instruction-response distribution can be decomposed as a low-rank matrix. We introduce an algorithm called LORIL for this setting and show that its regret scales as $\sqrt{T}$ where $T$ is the number of rounds and depends on the intrinsic rank but does not depend on the size of the agent's response space. We provide experiments in two domains showing that LORIL outperforms baselines even when the low-rank assumption is violated.

Provable Interactive Learning with Hindsight Instruction Feedback

TL;DR

This work formalizes interactive learning from hindsight instruction (LHI), where a teacher labels an agent's response with a suitable instruction instead of providing explicit expert actions or rewards. The authors prove a general lower bound showing regret can scale with the size of the response space, motivating structural assumptions. They then introduce LORIL, a low-rank aware algorithm that achieves a sublinear regret of order with dependence on the intrinsic rank rather than the size of the action space, and provide finite-sample regret guarantees under realizability and bounded-feature assumptions. Empirical results on synthetic and image-instruction tasks show LORIL outperforms baselines and remains effective even when the low-rank condition is violated, highlighting the potential of hindsight-label feedback for efficient instruction following. The work lays a foundation for further algorithmic development in hindsight-based supervision with practical implications for robotics and language-grounded systems.

Abstract

We study interactive learning in a setting where the agent has to generate a response (e.g., an action or trajectory) given a context and an instruction. In contrast, to typical approaches that train the system using reward or expert supervision on response, we study learning with hindsight instruction where a teacher provides an instruction that is most suitable for the agent's generated response. This hindsight labeling of instruction is often easier to provide than providing expert supervision of the optimal response which may require expert knowledge or can be impractical to elicit. We initiate the theoretical analysis of interactive learning with hindsight labeling. We first provide a lower bound showing that in general, the regret of any algorithm must scale with the size of the agent's response space. We then study a specialized setting where the underlying instruction-response distribution can be decomposed as a low-rank matrix. We introduce an algorithm called LORIL for this setting and show that its regret scales as where is the number of rounds and depends on the intrinsic rank but does not depend on the size of the agent's response space. We provide experiments in two domains showing that LORIL outperforms baselines even when the low-rank assumption is violated.
Paper Structure (37 sections, 17 theorems, 71 equations, 3 figures, 2 tables, 2 algorithms)

This paper contains 37 sections, 17 theorems, 71 equations, 3 figures, 2 tables, 2 algorithms.

Key Result

Theorem 1

Let $T \geq 256\log(2e)$ and $K \geq 8e$. For any algorithm, there is at least one stochastic world $W_{\hat{i}}$ such that $\textrm{Reg}(T) \geq \frac{\sqrt{KT}}{8}$ such that with probability at least $1/4e$.

Figures (3)

  • Figure 1: Shows sketch of our interactive Learning from Hindsight Instruction ( LHI) setting. The agent interacts with the world iteratively. In each round (or time step), the agent is given an instruction $x_t$ and a context $s_t$. In our case, the context $s_t$ is the house layout. In response, the agent generates a trajectory (response) $y_t$ which is then labeled by a teacher model with an instruction $x'_t$ (hindsight instruction). The agent never receives any expert response or rewards.
  • Figure 2: Comparison of ${\tt LORIL}$ against baselines on the controlled task. We run each baseline 3 times and report the average. The shaded areas show the standard deviation.
  • Figure 3: Results on the image classification task. We run each baseline 5 times and report the average performance. The shaded areas show the standard deviation.

Theorems & Definitions (17)

  • Theorem 1
  • Corollary 1
  • Theorem 2
  • Lemma 1
  • Theorem 2
  • Corollary 2
  • Proposition 3
  • Lemma 2
  • Corollary 4
  • Lemma 3
  • ...and 7 more