Table of Contents
Fetching ...

PrefMMT: Modeling Human Preferences in Preference-based Reinforcement Learning with Multimodal Transformers

Dezhong Zhao, Ruiqi Wang, Dayoon Suh, Taehyeon Kim, Ziqin Yuan, Byung-Cheol Min, Guohua Chen

TL;DR

This work tackles the challenge of modeling human preferences in preference-based RL by addressing the multimodal nature of robot trajectories. It introduces PrefMMT, a hierarchical multimodal transformer that decouples state and action modalities, applies intra-modal encoders, and fuses them with an inter-modal cross-attention module to produce a sequence of non-Markovian rewards, weighted by multimodal attention. The model is trained with a Bradley–Terrry likelihood and cross-entropy loss and is employed in offline RL (IQL) using a sliding window of transitions; experiments on AntMaze, D4RL Gym locomotion, and Meta-World show PrefMMT outperforming state-of-the-art PM baselines and even surpassing an oracle in some cases. The results underscore the importance of explicitly modeling both intra- and inter-modal dynamics for accurate preference credit assignment and more sample-efficient learning from real human feedback.

Abstract

Preference-based reinforcement learning (PbRL) shows promise in aligning robot behaviors with human preferences, but its success depends heavily on the accurate modeling of human preferences through reward models. Most methods adopt Markovian assumptions for preference modeling (PM), which overlook the temporal dependencies within robot behavior trajectories that impact human evaluations. While recent works have utilized sequence modeling to mitigate this by learning sequential non-Markovian rewards, they ignore the multimodal nature of robot trajectories, which consist of elements from two distinctive modalities: state and action. As a result, they often struggle to capture the complex interplay between these modalities that significantly shapes human preferences. In this paper, we propose a multimodal sequence modeling approach for PM by disentangling state and action modalities. We introduce a multimodal transformer network, named PrefMMT, which hierarchically leverages intra-modal temporal dependencies and inter-modal state-action interactions to capture complex preference patterns. We demonstrate that PrefMMT consistently outperforms state-of-the-art PM baselines on locomotion tasks from the D4RL benchmark and manipulation tasks from the Meta-World benchmark.

PrefMMT: Modeling Human Preferences in Preference-based Reinforcement Learning with Multimodal Transformers

TL;DR

This work tackles the challenge of modeling human preferences in preference-based RL by addressing the multimodal nature of robot trajectories. It introduces PrefMMT, a hierarchical multimodal transformer that decouples state and action modalities, applies intra-modal encoders, and fuses them with an inter-modal cross-attention module to produce a sequence of non-Markovian rewards, weighted by multimodal attention. The model is trained with a Bradley–Terrry likelihood and cross-entropy loss and is employed in offline RL (IQL) using a sliding window of transitions; experiments on AntMaze, D4RL Gym locomotion, and Meta-World show PrefMMT outperforming state-of-the-art PM baselines and even surpassing an oracle in some cases. The results underscore the importance of explicitly modeling both intra- and inter-modal dynamics for accurate preference credit assignment and more sample-efficient learning from real human feedback.

Abstract

Preference-based reinforcement learning (PbRL) shows promise in aligning robot behaviors with human preferences, but its success depends heavily on the accurate modeling of human preferences through reward models. Most methods adopt Markovian assumptions for preference modeling (PM), which overlook the temporal dependencies within robot behavior trajectories that impact human evaluations. While recent works have utilized sequence modeling to mitigate this by learning sequential non-Markovian rewards, they ignore the multimodal nature of robot trajectories, which consist of elements from two distinctive modalities: state and action. As a result, they often struggle to capture the complex interplay between these modalities that significantly shapes human preferences. In this paper, we propose a multimodal sequence modeling approach for PM by disentangling state and action modalities. We introduce a multimodal transformer network, named PrefMMT, which hierarchically leverages intra-modal temporal dependencies and inter-modal state-action interactions to capture complex preference patterns. We demonstrate that PrefMMT consistently outperforms state-of-the-art PM baselines on locomotion tasks from the D4RL benchmark and manipulation tasks from the Meta-World benchmark.
Paper Structure (21 sections, 10 equations, 4 figures, 1 table)

This paper contains 21 sections, 10 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Comparison of previous methods and our approach (PrefMMT) for preference modeling in PbRL. (a) Markovian Reward Modeling: Assumes that human preference for a trajectory $\sigma$ is based on the equal sum of individual evaluations at each time step. The goal is to learn a Markovian reward model that assigns rewards based solely on the immediate state-action pair. (b) Unimodal Sequence Modeling: Regards a trajectory as a sequence and learns a series of non-Markovian rewards that depend on all previously visited time steps. By learning unimodal attention weights $w^{uni}$ with unimodal transformer networks, this method aims to infer temporal dependencies within the trajectory and identify critical time steps that significantly influence human judgments. (c) Our Multimodal Sequence Modeling: Recognizes the multimodal nature of a trajectory, disentangling the state and action modalities. By learning multimodal attention weights $w^{mul}$ via a multimodal transformer architecture, our approach captures both temporal intra-modal dependencies and inter-modal interactions between states and actions within the trajectory, leading to more nuanced credit assignment for human preferences.
  • Figure 2: Illustration of the PrefMMT framework. Given a robot behavior trajectory as input, we first decouple the state and action modalities. Each unimodal sequence is then processed through an intra-modal encoder, where the temporal dependencies within the transitions of states and actions are explored. Subsequently, an inter-modal joint encoder captures the interactions between actions and states, outputting a series of non-Markovian rewards.
  • Figure 3: Confusion matrices and Pearson correlation (COR) of real human preference labels (y-axis) and predicted preference labels from different PM models. Labels: 1 and 0 denote a preference for the first or second trajectory, respectively, while -1 indicates indecision.
  • Figure 4: Series of learned preference rewards (yellow) along with normalized state (cyan) and action (purple) intra-modal attention weights, and state-action inter-modal (green) attention weights from PrefMMT on successful and failed trajectories in the AntMaze-large-play-v2 and Window Close tasks. Stars present the escape goals in AntMaze (the figure supports zooming in for more detailed information).