Table of Contents
Fetching ...

Update-Free On-Policy Steering via Verifiers

Maria Attarian, Ian Vyse, Claas Voelcker, Jasper Gerigk, Evgenii Opryshko, Anas Almasri, Sumeet Singh, Yilun Du, Igor Gilitschenski

TL;DR

UF-OPS is proposed, an Update-Free On-Policy Steering method that enables the robot to predict the success likelihood of its actions and adapt its strategy at execution time and improves the performance of black-box diffusion policy, without changing the base parameters, making it light-weight and flexible.

Abstract

In recent years, Behavior Cloning (BC) has become one of the most prevalent methods for enabling robots to mimic human demonstrations. However, despite their successes, BC policies are often brittle and struggle with precise manipulation. To overcome these issues, we propose UF-OPS, an Update-Free On-Policy Steering method that enables the robot to predict the success likelihood of its actions and adapt its strategy at execution time. We accomplish this by training verifier functions using policy rollout data obtained during an initial evaluation of the policy. These verifiers are subsequently used to steer the base policy toward actions with a higher likelihood of success. Our method improves the performance of black-box diffusion policy, without changing the base parameters, making it light-weight and flexible. We present results from both simulation and real-world data and achieve an average 49% improvement in success rate over the base policy across 5 real tasks.

Update-Free On-Policy Steering via Verifiers

TL;DR

UF-OPS is proposed, an Update-Free On-Policy Steering method that enables the robot to predict the success likelihood of its actions and adapt its strategy at execution time and improves the performance of black-box diffusion policy, without changing the base parameters, making it light-weight and flexible.

Abstract

In recent years, Behavior Cloning (BC) has become one of the most prevalent methods for enabling robots to mimic human demonstrations. However, despite their successes, BC policies are often brittle and struggle with precise manipulation. To overcome these issues, we propose UF-OPS, an Update-Free On-Policy Steering method that enables the robot to predict the success likelihood of its actions and adapt its strategy at execution time. We accomplish this by training verifier functions using policy rollout data obtained during an initial evaluation of the policy. These verifiers are subsequently used to steer the base policy toward actions with a higher likelihood of success. Our method improves the performance of black-box diffusion policy, without changing the base parameters, making it light-weight and flexible. We present results from both simulation and real-world data and achieve an average 49% improvement in success rate over the base policy across 5 real tasks.
Paper Structure (26 sections, 9 equations, 6 figures, 2 tables, 2 algorithms)

This paper contains 26 sections, 9 equations, 6 figures, 2 tables, 2 algorithms.

Figures (6)

  • Figure 1: Our method relies on a policy's own evaluation data to improve its performance. Training small verifiers and subsequently, utilizing them via inference-time steering, allows for improved policy performance without costly data collections and resource extensive fine-tuning.
  • Figure 2: Method overview: Given a base policy trained on a dataset of expert demonstrations, policy evaluation provides successful and failed rollouts. These are used for training a verifier function that scores a transition $(s, a)$ in terms of its success likelihood. Finally, the verifier function is used in combination with a steering strategy, to improve the policy performance.
  • Figure 3: Real tasks on the Aloha bimanual system zhao2023learning. From left to right, a) pick up the block and place it on the cardboard, b) pick up the ball and place it in the bowl, c) pick up the hammer with the right hand, hand it over to the left hand and drop it in the box, d) pick up the pen cap with the right hand and the cap with the left and insert it on the pen, and e) pick up the green cup and stack it on the purple.
  • Figure 4: 2D Toy Example. The expert demonstrations follow S-curves that go through the two doors. The unguided baseline frequently fails at the narrow door. Therefor our guided method redirects traffic favoring the wide door.
  • Figure 5: 2D Toy Example. The classifier correctly identifies the wide door (purple region) as the safer option.
  • ...and 1 more figures