Table of Contents
Fetching ...

The Dark Side of Rich Rewards: Understanding and Mitigating Noise in VLM Rewards

Sukai Huang, Shu-Wei Liu, Nir Lipovetzky, Trevor Cohn

TL;DR

This work analyzes why vision-language reward models (VLM-RMs) introduce harmful noise for embodied agents, with false positives in cosine similarity driving misleading rewards in long-horizon tasks. It formalizes a theoretical link between VLM rewards, HuRL heuristics, and pessimism, showing FP rewards inflate the optimality gap and FN rewards can be less harmful. The authors introduce BiMI, a noise-resilient reward that uses a binary signal with a conformal-threshold and incorporates mutual information to curb overfitting, improving learning across multiple challenging environments and synergizing with intrinsic rewards. The results emphasize the importance of handling multimodal reward noise for practical instruction-following agents and provide concrete design tools for reducing FP rewards in VLM-based RL.

Abstract

While Vision-Language Models (VLMs) are increasingly used to generate reward signals for training embodied agents to follow instructions, our research reveals that agents guided by VLM rewards often underperform compared to those employing only intrinsic (exploration-driven) rewards, contradicting expectations set by recent work. We hypothesize that false positive rewards -- instances where unintended trajectories are incorrectly rewarded -- are more detrimental than false negatives. Our analysis confirms this hypothesis, revealing that the widely used cosine similarity metric is prone to false positive reward estimates. To address this, we introduce BiMI ({Bi}nary {M}utual {I}nformation), a novel reward function designed to mitigate noise. BiMI significantly enhances learning efficiency across diverse and challenging embodied navigation environments. Our findings offer a nuanced understanding of how different types of reward noise impact agent learning and highlight the importance of addressing multimodal reward signal noise when training embodied agents

The Dark Side of Rich Rewards: Understanding and Mitigating Noise in VLM Rewards

TL;DR

This work analyzes why vision-language reward models (VLM-RMs) introduce harmful noise for embodied agents, with false positives in cosine similarity driving misleading rewards in long-horizon tasks. It formalizes a theoretical link between VLM rewards, HuRL heuristics, and pessimism, showing FP rewards inflate the optimality gap and FN rewards can be less harmful. The authors introduce BiMI, a noise-resilient reward that uses a binary signal with a conformal-threshold and incorporates mutual information to curb overfitting, improving learning across multiple challenging environments and synergizing with intrinsic rewards. The results emphasize the importance of handling multimodal reward noise for practical instruction-following agents and provide concrete design tools for reducing FP rewards in VLM-based RL.

Abstract

While Vision-Language Models (VLMs) are increasingly used to generate reward signals for training embodied agents to follow instructions, our research reveals that agents guided by VLM rewards often underperform compared to those employing only intrinsic (exploration-driven) rewards, contradicting expectations set by recent work. We hypothesize that false positive rewards -- instances where unintended trajectories are incorrectly rewarded -- are more detrimental than false negatives. Our analysis confirms this hypothesis, revealing that the widely used cosine similarity metric is prone to false positive reward estimates. To address this, we introduce BiMI ({Bi}nary {M}utual {I}nformation), a novel reward function designed to mitigate noise. BiMI significantly enhances learning efficiency across diverse and challenging embodied navigation environments. Our findings offer a nuanced understanding of how different types of reward noise impact agent learning and highlight the importance of addressing multimodal reward signal noise when training embodied agents
Paper Structure (32 sections, 13 theorems, 22 equations, 19 figures, 8 tables, 2 algorithms)

This paper contains 32 sections, 13 theorems, 22 equations, 19 figures, 8 tables, 2 algorithms.

Key Result

Proposition 4.0

The sum of expected time for a series of random walks, each covering the shorter distance of an individual sub-task, is less than the expected time to travel the entire distance $D$ in one long random walk: $\frac{1}{n-1}\mathbb E[T_D] \leq \mathbb E\left[ \sum_{i=1}^{n-1} T_{d_i} \right] < \mathbb

Figures (19)

  • Figure 1: Illustration of embodied RL agents using VLM reward model
  • Figure 2: Schematic diagram of false positives in embedding space.
  • Figure 2: Model score across various environments. $\star$ is the baseline agents with a learned VLM-based reward model to compare with. BiMI significantly improves performance in Montezuma and Minigrid, while showing mixed results in Crafter
  • Figure 3: Learned VLM models performed badly with O.O.D. examples. They incorrectly assign high scores to manipulated pairs, which should be low as the trajectories in the manipulated pairs fail the instruction.
  • Figure 4: The false positive vs. false negative oracle model. The false positive model get a more severe drop in the final training score.
  • ...and 14 more figures

Theorems & Definitions (22)

  • Proposition 4.0
  • Proposition 4.0
  • Definition 4.1: False Positive Rewards
  • Definition 4.2: False Negative Rewards
  • Proposition 4.2
  • proof
  • Definition 4.3: Bellman-consistent Pessimistic $h$
  • Proposition 4.3
  • Theorem 4.4
  • Lemma D.0
  • ...and 12 more