Table of Contents
Fetching ...

MAVRL: Learning Reward Functions from Multiple Feedback Types with Amortized Variational Inference

Raphaël Baur, Yannick Metz, Maria Gkoulta, Mennatallah El-Assady, Giorgia Ramponi, Thomas Kleine Buening

TL;DR

This work introduces a scalable amortized variational inference approach that learns a shared reward encoder and feedback-specific likelihood decoders and is trained by optimizing a single evidence lower bound, which avoids reducing feedback to a common intermediate representation and eliminates the need for manual loss balancing.

Abstract

Reward learning typically relies on a single feedback type or combines multiple feedback types using manually weighted loss terms. Currently, it remains unclear how to jointly learn reward functions from heterogeneous feedback types such as demonstrations, comparisons, ratings, and stops that provide qualitatively different signals. We address this challenge by formulating reward learning from multiple feedback types as Bayesian inference over a shared latent reward function, where each feedback type contributes information through an explicit likelihood. We introduce a scalable amortized variational inference approach that learns a shared reward encoder and feedback-specific likelihood decoders and is trained by optimizing a single evidence lower bound. Our approach avoids reducing feedback to a common intermediate representation and eliminates the need for manual loss balancing. Across discrete and continuous-control benchmarks, we show that jointly inferred reward posteriors outperform single-type baselines, exploit complementary information across feedback types, and yield policies that are more robust to environment perturbations. The inferred reward uncertainty further provides interpretable signals for analyzing model confidence and consistency across feedback types.

MAVRL: Learning Reward Functions from Multiple Feedback Types with Amortized Variational Inference

TL;DR

This work introduces a scalable amortized variational inference approach that learns a shared reward encoder and feedback-specific likelihood decoders and is trained by optimizing a single evidence lower bound, which avoids reducing feedback to a common intermediate representation and eliminates the need for manual loss balancing.

Abstract

Reward learning typically relies on a single feedback type or combines multiple feedback types using manually weighted loss terms. Currently, it remains unclear how to jointly learn reward functions from heterogeneous feedback types such as demonstrations, comparisons, ratings, and stops that provide qualitatively different signals. We address this challenge by formulating reward learning from multiple feedback types as Bayesian inference over a shared latent reward function, where each feedback type contributes information through an explicit likelihood. We introduce a scalable amortized variational inference approach that learns a shared reward encoder and feedback-specific likelihood decoders and is trained by optimizing a single evidence lower bound. Our approach avoids reducing feedback to a common intermediate representation and eliminates the need for manual loss balancing. Across discrete and continuous-control benchmarks, we show that jointly inferred reward posteriors outperform single-type baselines, exploit complementary information across feedback types, and yield policies that are more robust to environment perturbations. The inferred reward uncertainty further provides interpretable signals for analyzing model confidence and consistency across feedback types.
Paper Structure (47 sections, 15 equations, 7 figures, 8 tables)

This paper contains 47 sections, 15 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Overview of MAVRL. A shared variational reward encoder predicts reward samples from state–action pairs for each feedback modality. A jointly optimized Q-value model estimates optimal values for the same transitions. Both reward samples and Q-values are passed to modality-specific decoders, which maximize their respective likelihoods. A KL-divergence regularizer enforces consistency with a standard normal prior, while a TD-error constraint aligns reward and Q-value estimates through one-step Bellman differences. MAVRL is naturally extensible since incorporating additional feedback types only requires adding a corresponding likelihood decoder.
  • Figure 2: Visualizations of inferred reward functions from $2$ demonstrations, $256$ pairwise comparisons, or $128$ ratings on a $10 \times 10$ grid_trap environment. The final column shows the result obtained when combining all feedback modalities.
  • Figure 3: Normalized mean returns ($n=10$) of policies trained on rewards inferred by each method under three dynamics perturbation scenarios. Reward models and baselines are trained in the unperturbed setting and remain fixed throughout the variations. Error bars denote standard error. (a) Increasing environmental stochasticity in grid_cliff. (b) Increasing ratio between pendulum handle lengths in Acrobot-v1. (c) Increasing gravity and wind-power in LunarLander-v3.
  • Figure 4: Normalized mean returns ($n=10$) of policies trained on rewards inferred by each method under three dynamics perturbation scenarios. Reward models and baselines are trained in the unperturbed setting and remain fixed throughout the variations. Returns are normalized such that $100.0$ corresponds to the performance of an approximately optimal policy trained on the ground-truth reward in the unperturbed setting and, analogously, $0.0$ corresponds to the performance of a uniformly random policy. Error bars denote standard error. (a) and (b) Increasing environmental stochasticity in grid_cliff and grid_trap. (c) Increasing gravity and wind-power in LunarLander-v3. (d) Increasing ratio between pendulum handle lengths in Acrobot-v1.
  • Figure 5: grid_sparse: Visualizations of inferred reward functions from $2$ demonstrations, $256$ pairwise comparisons, or $128$ ratings on a $10 \times 10$ grid_sparse environment. The final column shows the result obtained when combining all feedback modalities. (a) Joint visual encoding of mean and variance. (b) Separate visual encoding of mean and variance for the same data. The rightmost column shows the result obtained when combining all feedback modalities.
  • ...and 2 more figures