MAVRL: Learning Reward Functions from Multiple Feedback Types with Amortized Variational Inference

Raphaël Baur; Yannick Metz; Maria Gkoulta; Mennatallah El-Assady; Giorgia Ramponi; Thomas Kleine Buening

MAVRL: Learning Reward Functions from Multiple Feedback Types with Amortized Variational Inference

Raphaël Baur, Yannick Metz, Maria Gkoulta, Mennatallah El-Assady, Giorgia Ramponi, Thomas Kleine Buening

TL;DR

This work introduces a scalable amortized variational inference approach that learns a shared reward encoder and feedback-specific likelihood decoders and is trained by optimizing a single evidence lower bound, which avoids reducing feedback to a common intermediate representation and eliminates the need for manual loss balancing.

Abstract

Reward learning typically relies on a single feedback type or combines multiple feedback types using manually weighted loss terms. Currently, it remains unclear how to jointly learn reward functions from heterogeneous feedback types such as demonstrations, comparisons, ratings, and stops that provide qualitatively different signals. We address this challenge by formulating reward learning from multiple feedback types as Bayesian inference over a shared latent reward function, where each feedback type contributes information through an explicit likelihood. We introduce a scalable amortized variational inference approach that learns a shared reward encoder and feedback-specific likelihood decoders and is trained by optimizing a single evidence lower bound. Our approach avoids reducing feedback to a common intermediate representation and eliminates the need for manual loss balancing. Across discrete and continuous-control benchmarks, we show that jointly inferred reward posteriors outperform single-type baselines, exploit complementary information across feedback types, and yield policies that are more robust to environment perturbations. The inferred reward uncertainty further provides interpretable signals for analyzing model confidence and consistency across feedback types.

MAVRL: Learning Reward Functions from Multiple Feedback Types with Amortized Variational Inference

TL;DR

Abstract

Paper Structure (47 sections, 15 equations, 7 figures, 8 tables)

This paper contains 47 sections, 15 equations, 7 figures, 8 tables.

Introduction
Contributions
Related Work
Reward Learning.
Approximate Inference for Bayesian IRL.
Multi-Type Feedback.
Preliminaries
Bayesian Learning from Multi-Type Feedback.
Amortized Variational Inference.
Feedback-Specific Likelihood Models
Preferences ( ).
Demonstrations ( ).
Ratings ( ).
Stops ( ).
Additional Feedback Types.
...and 32 more sections

Figures (7)

Figure 1: Overview of MAVRL. A shared variational reward encoder predicts reward samples from state–action pairs for each feedback modality. A jointly optimized Q-value model estimates optimal values for the same transitions. Both reward samples and Q-values are passed to modality-specific decoders, which maximize their respective likelihoods. A KL-divergence regularizer enforces consistency with a standard normal prior, while a TD-error constraint aligns reward and Q-value estimates through one-step Bellman differences. MAVRL is naturally extensible since incorporating additional feedback types only requires adding a corresponding likelihood decoder.
Figure 2: Visualizations of inferred reward functions from $2$ demonstrations, $256$ pairwise comparisons, or $128$ ratings on a $10 \times 10$ grid_trap environment. The final column shows the result obtained when combining all feedback modalities.
Figure 3: Normalized mean returns ($n=10$) of policies trained on rewards inferred by each method under three dynamics perturbation scenarios. Reward models and baselines are trained in the unperturbed setting and remain fixed throughout the variations. Error bars denote standard error. (a) Increasing environmental stochasticity in grid_cliff. (b) Increasing ratio between pendulum handle lengths in Acrobot-v1. (c) Increasing gravity and wind-power in LunarLander-v3.
Figure 4: Normalized mean returns ($n=10$) of policies trained on rewards inferred by each method under three dynamics perturbation scenarios. Reward models and baselines are trained in the unperturbed setting and remain fixed throughout the variations. Returns are normalized such that $100.0$ corresponds to the performance of an approximately optimal policy trained on the ground-truth reward in the unperturbed setting and, analogously, $0.0$ corresponds to the performance of a uniformly random policy. Error bars denote standard error. (a) and (b) Increasing environmental stochasticity in grid_cliff and grid_trap. (c) Increasing gravity and wind-power in LunarLander-v3. (d) Increasing ratio between pendulum handle lengths in Acrobot-v1.
Figure 5: grid_sparse: Visualizations of inferred reward functions from $2$ demonstrations, $256$ pairwise comparisons, or $128$ ratings on a $10 \times 10$ grid_sparse environment. The final column shows the result obtained when combining all feedback modalities. (a) Joint visual encoding of mean and variance. (b) Separate visual encoding of mean and variance for the same data. The rightmost column shows the result obtained when combining all feedback modalities.
...and 2 more figures

MAVRL: Learning Reward Functions from Multiple Feedback Types with Amortized Variational Inference

TL;DR

Abstract

MAVRL: Learning Reward Functions from Multiple Feedback Types with Amortized Variational Inference

Authors

TL;DR

Abstract

Table of Contents

Figures (7)