Table of Contents
Fetching ...

Causally Robust Reward Learning from Reason-Augmented Preference Feedback

Minjune Hwang, Yigit Korkmaz, Daniel Seita, Erdem Bıyık

TL;DR

ReCouPLe is introduced, a lightweight framework that uses natural language rationales to provide the missing causal signal, and outperforms baselines by up to 1.5x in reward accuracy under distribution shifts, and 2x in downstream policy performance in novel tasks.

Abstract

Preference-based reward learning is widely used for shaping agent behavior to match a user's preference, yet its sparse binary feedback makes it especially vulnerable to causal confusion. The learned reward often latches onto spurious features that merely co-occur with preferred trajectories during training, collapsing when those correlations disappear or reverse at test time. We introduce ReCouPLe, a lightweight framework that uses natural language rationales to provide the missing causal signal. Each rationale is treated as a guiding projection axis in an embedding space, training the model to score trajectories based on features aligned with that axis while de-emphasizing context that is unrelated to the stated reason. Because the same rationales (e.g., "avoids collisions", "completes the task faster") can appear across multiple tasks, ReCouPLe naturally reuses the same causal direction whenever tasks share semantics, and transfers preference knowledge to novel tasks without extra data or language-model fine-tuning. Our learned reward model can ground preferences on the articulated reason, aligning better with user intent and generalizing beyond spurious features. ReCouPLe outperforms baselines by up to 1.5x in reward accuracy under distribution shifts, and 2x in downstream policy performance in novel tasks. We have released our code at https://github.com/mj-hwang/ReCouPLe

Causally Robust Reward Learning from Reason-Augmented Preference Feedback

TL;DR

ReCouPLe is introduced, a lightweight framework that uses natural language rationales to provide the missing causal signal, and outperforms baselines by up to 1.5x in reward accuracy under distribution shifts, and 2x in downstream policy performance in novel tasks.

Abstract

Preference-based reward learning is widely used for shaping agent behavior to match a user's preference, yet its sparse binary feedback makes it especially vulnerable to causal confusion. The learned reward often latches onto spurious features that merely co-occur with preferred trajectories during training, collapsing when those correlations disappear or reverse at test time. We introduce ReCouPLe, a lightweight framework that uses natural language rationales to provide the missing causal signal. Each rationale is treated as a guiding projection axis in an embedding space, training the model to score trajectories based on features aligned with that axis while de-emphasizing context that is unrelated to the stated reason. Because the same rationales (e.g., "avoids collisions", "completes the task faster") can appear across multiple tasks, ReCouPLe naturally reuses the same causal direction whenever tasks share semantics, and transfers preference knowledge to novel tasks without extra data or language-model fine-tuning. Our learned reward model can ground preferences on the articulated reason, aligning better with user intent and generalizing beyond spurious features. ReCouPLe outperforms baselines by up to 1.5x in reward accuracy under distribution shifts, and 2x in downstream policy performance in novel tasks. We have released our code at https://github.com/mj-hwang/ReCouPLe
Paper Structure (25 sections, 9 equations, 6 figures, 11 tables)

This paper contains 25 sections, 9 equations, 6 figures, 11 tables.

Figures (6)

  • Figure 1: Preference learning can be susceptible to causal confusion, especially with the presence of non-causal distractor features that merely co-occur with preferred trajectories. In the example above, the reward model struggles to identify the exact feature of a trajectory that determined the user's preference. By providing a reason, the agent can identify the causal feature.
  • Figure 2: ReCouPLe decomposes the task reward by orthogonally projecting the trajectory representation to the reason language embedding and decomposing the representation into reason-aligned and reason-orthogonal components. This allows the reward model to isolate the causal feature specified in the rationale to explain the user's preference.
  • Figure 3: ManiSkill policy learning results, averaged over 3 seeds (mean $\pm$ std).
  • Figure 4: Meta-World policy earning on the held-out task, averaged over 3 seeds (mean $\pm$ std). Both ReCouPLe variants outperform the baselines, showing task transfer capability.
  • Figure 5: Terminal states for custom ManiSkill tasks. Each column represents a specific task, with the top row showing preferred trajectories (manipulating the larger cube) and the bottom row showing non-preferred trajectories (manipulating the smaller cube). Tasks and their respective color confounder are defined in \ref{['tab:cube-color']}.
  • ...and 1 more figures