Table of Contents
Fetching ...

Reward Evolution with Graph-of-Thoughts: A Bi-Level Language Model Framework for Reinforcement Learning

Changwei Yao, Xinzi Liu, Chen Li, Marios Savvides

TL;DR

RE-GoT tackles reward design in reinforcement learning by marrying Graph-of-Thoughts (GoT) with multimodal feedback in a bi-level framework that minimizes human supervision. The upper level uses Visual Language Models (VLMs) to evaluate rollout videos and provide visual feedback, while the lower level employs Large Language Models (LLMs) to construct a text-attributed GoT graph and refine the reward function through guided, gradient-free optimization. Evaluations on RoboGen and ManiSkill2 show substantial improvements over prior LLM-based baselines and approach oracle rewards on several tasks, demonstrating robust generalization across platforms. By integrating structured graph-based reasoning with automated visual feedback, RE-GoT offers a scalable approach to autonomous reward evolution that can enhance policy learning in complex robotic manipulation. The work contributes the first integration of GoT into automatic reward generation, a closed-loop bi-level design with VLM feedback, and cross-platform validation across diverse manipulation tasks.

Abstract

Designing effective reward functions remains a major challenge in reinforcement learning (RL), often requiring considerable human expertise and iterative refinement. Recent advances leverage Large Language Models (LLMs) for automated reward design, but these approaches are limited by hallucinations, reliance on human feedback, and challenges with handling complex, multi-step tasks. In this work, we introduce Reward Evolution with Graph-of-Thoughts (RE-GoT), a novel bi-level framework that enhances LLMs with structured graph-based reasoning and integrates Visual Language Models (VLMs) for automated rollout evaluation. RE-GoT first decomposes tasks into text-attributed graphs, enabling comprehensive analysis and reward function generation, and then iteratively refines rewards using visual feedback from VLMs without human intervention. Extensive experiments on 10 RoboGen and 4 ManiSkill2 tasks demonstrate that RE-GoT consistently outperforms existing LLM-based baselines. On RoboGen, our method improves average task success rates by 32.25%, with notable gains on complex multi-step tasks. On ManiSkill2, RE-GoT achieves an average success rate of 93.73% across four diverse manipulation tasks, significantly surpassing prior LLM-based approaches and even exceeding expert-designed rewards. Our results indicate that combining LLMs and VLMs with graph-of-thoughts reasoning provides a scalable and effective solution for autonomous reward evolution in RL.

Reward Evolution with Graph-of-Thoughts: A Bi-Level Language Model Framework for Reinforcement Learning

TL;DR

RE-GoT tackles reward design in reinforcement learning by marrying Graph-of-Thoughts (GoT) with multimodal feedback in a bi-level framework that minimizes human supervision. The upper level uses Visual Language Models (VLMs) to evaluate rollout videos and provide visual feedback, while the lower level employs Large Language Models (LLMs) to construct a text-attributed GoT graph and refine the reward function through guided, gradient-free optimization. Evaluations on RoboGen and ManiSkill2 show substantial improvements over prior LLM-based baselines and approach oracle rewards on several tasks, demonstrating robust generalization across platforms. By integrating structured graph-based reasoning with automated visual feedback, RE-GoT offers a scalable approach to autonomous reward evolution that can enhance policy learning in complex robotic manipulation. The work contributes the first integration of GoT into automatic reward generation, a closed-loop bi-level design with VLM feedback, and cross-platform validation across diverse manipulation tasks.

Abstract

Designing effective reward functions remains a major challenge in reinforcement learning (RL), often requiring considerable human expertise and iterative refinement. Recent advances leverage Large Language Models (LLMs) for automated reward design, but these approaches are limited by hallucinations, reliance on human feedback, and challenges with handling complex, multi-step tasks. In this work, we introduce Reward Evolution with Graph-of-Thoughts (RE-GoT), a novel bi-level framework that enhances LLMs with structured graph-based reasoning and integrates Visual Language Models (VLMs) for automated rollout evaluation. RE-GoT first decomposes tasks into text-attributed graphs, enabling comprehensive analysis and reward function generation, and then iteratively refines rewards using visual feedback from VLMs without human intervention. Extensive experiments on 10 RoboGen and 4 ManiSkill2 tasks demonstrate that RE-GoT consistently outperforms existing LLM-based baselines. On RoboGen, our method improves average task success rates by 32.25%, with notable gains on complex multi-step tasks. On ManiSkill2, RE-GoT achieves an average success rate of 93.73% across four diverse manipulation tasks, significantly surpassing prior LLM-based approaches and even exceeding expert-designed rewards. Our results indicate that combining LLMs and VLMs with graph-of-thoughts reasoning provides a scalable and effective solution for autonomous reward evolution in RL.

Paper Structure

This paper contains 25 sections, 4 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Conceptual illustration of GoT. (a) Three general GoT examples, where each node is a thought. (b) GoT applied to a manipulation task named Store Item in Storage: nodes represent sub-goal stages, and edges represent robot behaviors for transitioning between stages, both with detailed textual descriptions.
  • Figure 2: Overview of the RE-GoT framework. The upper-level evaluates rollout videos using VLMs to provide visual feedback, while the lower-level refines reward functions using LLMs with a graph-based reasoning approach. (a) It prompts LLMs with the environment abstraction to connect to the robotics system. (b) LLMs decompose the task into a text-attributed graph. (c) Given the graph structure and visual feedback, LLMs refine the reward function. (d) VLMs analyze the rollout videos to provide structured feedback on the trained RL agent.
  • Figure 3: Evaluation environments. Ten tasks from RoboGen on the left: Open both table doors, Retrieve item from safe, Flush toilet, Close window, Tilt display screen, Close dispenser lid, Turn on Lamp, Load dish into dishwasher, Store item into storage, and Rotate safe knob. Four tasks from ManiSkill2 on the right: PickCube, OpenCabinetDrawer, OpenCabinetDoor, and PushChair.
  • Figure 4: Example of the text-attributed graph for Press the Start Button, where $S_i$ indicates the index of the sub-goal.
  • Figure 5: Success Rate & Average Episode Length vs Exploration Steps on four ManiSkill2 tasks. The solid lines represent the mean, while the shaded areas indicate the standard error of the mean. Oracle means the expert-written reward function provided by the environment.
  • ...and 1 more figures