Table of Contents
Fetching ...

VIRAL: Vision-grounded Integration for Reward design And Learning

Valentin Cuzin-Rambaud, Emilien Komlenovic, Alexandre Faure, Bruno Yun

TL;DR

VIRAL tackles reward shaping in reinforcement learning by introducing a vision-grounded, multi-modal LLM pipeline that automatically designs and refines reward functions from simple prompts and annotated images. The approach uses a critic-coder LLM collaboration, step-back prompting, and open-source LVLMs to generate initial rewards, followed by policy learning with DQN or PPO. A refinement loop incorporates statistics and Video-LVLM or human feedback to iteratively improve rewards, validated across five Gymnasium environments. Empirical results show faster learning and better alignment with user intent, with notable gains in CartPole and Highway, and LunarLander benefits when feedback is used. The work highlights practical benefits for accessible, adaptable reward design and suggests future directions in policy adaptation for broader task generalization.

Abstract

The alignment between humans and machines is a critical challenge in artificial intelligence today. Reinforcement learning, which aims to maximize a reward function, is particularly vulnerable to the risks associated with poorly designed reward functions. Recent advancements has shown that Large Language Models (LLMs) for reward generation can outperform human performance in this context. We introduce VIRAL, a pipeline for generating and refining reward functions through the use of multi-modal LLMs. VIRAL autonomously creates and interactively improves reward functions based on a given environment and a goal prompt or annotated image. The refinement process can incorporate human feedback or be guided by a description generated by a video LLM, which explains the agent's policy in video form. We evaluated VIRAL in five Gymnasium environments, demonstrating that it accelerates the learning of new behaviors while ensuring improved alignment with user intent. The source-code and demo video are available at: https://github.com/VIRAL-UCBL1/VIRAL and https://youtu.be/Hqo82CxVT38.

VIRAL: Vision-grounded Integration for Reward design And Learning

TL;DR

VIRAL tackles reward shaping in reinforcement learning by introducing a vision-grounded, multi-modal LLM pipeline that automatically designs and refines reward functions from simple prompts and annotated images. The approach uses a critic-coder LLM collaboration, step-back prompting, and open-source LVLMs to generate initial rewards, followed by policy learning with DQN or PPO. A refinement loop incorporates statistics and Video-LVLM or human feedback to iteratively improve rewards, validated across five Gymnasium environments. Empirical results show faster learning and better alignment with user intent, with notable gains in CartPole and Highway, and LunarLander benefits when feedback is used. The work highlights practical benefits for accessible, adaptable reward design and suggests future directions in policy adaptation for broader task generalization.

Abstract

The alignment between humans and machines is a critical challenge in artificial intelligence today. Reinforcement learning, which aims to maximize a reward function, is particularly vulnerable to the risks associated with poorly designed reward functions. Recent advancements has shown that Large Language Models (LLMs) for reward generation can outperform human performance in this context. We introduce VIRAL, a pipeline for generating and refining reward functions through the use of multi-modal LLMs. VIRAL autonomously creates and interactively improves reward functions based on a given environment and a goal prompt or annotated image. The refinement process can incorporate human feedback or be guided by a description generated by a video LLM, which explains the agent's policy in video form. We evaluated VIRAL in five Gymnasium environments, demonstrating that it accelerates the learning of new behaviors while ensuring improved alignment with user intent. The source-code and demo video are available at: https://github.com/VIRAL-UCBL1/VIRAL and https://youtu.be/Hqo82CxVT38.

Paper Structure

This paper contains 12 sections, 2 figures, 2 tables.

Figures (2)

  • Figure 1: The VIRAL pipeline. Given an input (a textual environment description, an optional success function, and a goal prompt), the system generates a set of reward functions and iteratively refines them.
  • Figure 2: Semantic alignment over the 10 videos of each annotator for different gymnasium's environments and modalities.