Table of Contents
Fetching ...

Language-Model-Assisted Bi-Level Programming for Reward Learning from Internet Videos

Harsh Mahesheka, Zhixian Xie, Zhaoran Wang, Wanxin Jin

TL;DR

This work tackles the data-collection bottleneck in learning from demonstrations by learning rewards directly from internet videos without task-specific preprocessing. It introduces a gradient-free, bi-level framework where a vision-language model provides high-level visual feedback and a large language model translates that feedback into executable reward updates, enabling reinforcement learning agents to imitate complex biological motions. The approach is validated on multiple robots using YouTube demonstrations, outperforming gradient-based baselines like Eureka and approaching human-level feedback when using a human-in-the-loop baseline. The results demonstrate scalable reward design from in-the-wild videos with potential to extend visual IRL to a wide range of skills and agents.

Abstract

Learning from Demonstrations, particularly from biological experts like humans and animals, often encounters significant data acquisition challenges. While recent approaches leverage internet videos for learning, they require complex, task-specific pipelines to extract and retarget motion data for the agent. In this work, we introduce a language-model-assisted bi-level programming framework that enables a reinforcement learning agent to directly learn its reward from internet videos, bypassing dedicated data preparation. The framework includes two levels: an upper level where a vision-language model (VLM) provides feedback by comparing the learner's behavior with expert videos, and a lower level where a large language model (LLM) translates this feedback into reward updates. The VLM and LLM collaborate within this bi-level framework, using a "chain rule" approach to derive a valid search direction for reward learning. We validate the method for reward learning from YouTube videos, and the results have shown that the proposed method enables efficient reward design from expert videos of biological agents for complex behavior synthesis.

Language-Model-Assisted Bi-Level Programming for Reward Learning from Internet Videos

TL;DR

This work tackles the data-collection bottleneck in learning from demonstrations by learning rewards directly from internet videos without task-specific preprocessing. It introduces a gradient-free, bi-level framework where a vision-language model provides high-level visual feedback and a large language model translates that feedback into executable reward updates, enabling reinforcement learning agents to imitate complex biological motions. The approach is validated on multiple robots using YouTube demonstrations, outperforming gradient-based baselines like Eureka and approaching human-level feedback when using a human-in-the-loop baseline. The results demonstrate scalable reward design from in-the-wild videos with potential to extend visual IRL to a wide range of skills and agents.

Abstract

Learning from Demonstrations, particularly from biological experts like humans and animals, often encounters significant data acquisition challenges. While recent approaches leverage internet videos for learning, they require complex, task-specific pipelines to extract and retarget motion data for the agent. In this work, we introduce a language-model-assisted bi-level programming framework that enables a reinforcement learning agent to directly learn its reward from internet videos, bypassing dedicated data preparation. The framework includes two levels: an upper level where a vision-language model (VLM) provides feedback by comparing the learner's behavior with expert videos, and a lower level where a large language model (LLM) translates this feedback into reward updates. The VLM and LLM collaborate within this bi-level framework, using a "chain rule" approach to derive a valid search direction for reward learning. We validate the method for reward learning from YouTube videos, and the results have shown that the proposed method enables efficient reward design from expert videos of biological agents for complex behavior synthesis.

Paper Structure

This paper contains 29 sections, 11 equations, 8 figures.

Figures (8)

  • Figure 1: Overview of the language-model-assisted bi-level framework for reward learning from videos. Upper-level VLM generates visual feedback by comparing the expert video and the video recording of current robot behavior. Lower-level LLM uses the feedback to update the reward code.
  • Figure 2: One example of the upper-level visual feedback. Robot behavior recording and biological (expert) motion video are provided to VLM, which then generates textual instructions for robot behavior improvement.
  • Figure 3: One example of the reward code update for low-level LLM, by taking as input the language instructions from the VLM and environment code as context.
  • Figure 4: Screenshots from YouTube videos used for reward learning
  • Figure 5: (Left) Our approach consistently outperforms Eureka and produces results similar to Human as VLM, showcasing the ability of VLM to guide reward search in learning from biological traits. (Right) Human preferences further verify our claims that we can mimic complex traits from biological videos at par with an Human Expert.
  • ...and 3 more figures