Table of Contents
Fetching ...

CLIP-Motion: Learning Reward Functions for Robotic Actions Using Consecutive Observations

Xuzhe Dang, Stefan Edelkamp

TL;DR

The paper tackles reward design for robotic manipulation by shifting focus from state-based to action-based rewards defined through STRIPS-inspired abstract motions. It uses a CLIP-based fine-tuning approach to map consecutive observations to abstract motions, including a convolutional channel expansion to handle two-frame inputs. DDPG-based policy learning shows that internal-state rewards yield the fastest and most stable learning across six Metaworld tasks, while an image-based CLIP-Motion variant provides competitive results with some variance due to prediction errors. The approach promises improved sample efficiency and robustness in RL for robotics, and points to future automation of motion decomposition and leveraging demonstrations to further reduce data requirements.

Abstract

This paper presents a novel method for learning reward functions for robotic motions by harnessing the power of a CLIP-based model. Traditional reward function design often hinges on manual feature engineering, which can struggle to generalize across an array of tasks. Our approach circumvents this challenge by capitalizing on CLIP's capability to process both state features and image inputs effectively. Given a pair of consecutive observations, our model excels in identifying the motion executed between them. We showcase results spanning various robotic activities, such as directing a gripper to a designated target and adjusting the position of a cube. Through experimental evaluations, we underline the proficiency of our method in precisely deducing motion and its promise to enhance reinforcement learning training in the realm of robotics.

CLIP-Motion: Learning Reward Functions for Robotic Actions Using Consecutive Observations

TL;DR

The paper tackles reward design for robotic manipulation by shifting focus from state-based to action-based rewards defined through STRIPS-inspired abstract motions. It uses a CLIP-based fine-tuning approach to map consecutive observations to abstract motions, including a convolutional channel expansion to handle two-frame inputs. DDPG-based policy learning shows that internal-state rewards yield the fastest and most stable learning across six Metaworld tasks, while an image-based CLIP-Motion variant provides competitive results with some variance due to prediction errors. The approach promises improved sample efficiency and robustness in RL for robotics, and points to future automation of motion decomposition and leveraging demonstrations to further reduce data requirements.

Abstract

This paper presents a novel method for learning reward functions for robotic motions by harnessing the power of a CLIP-based model. Traditional reward function design often hinges on manual feature engineering, which can struggle to generalize across an array of tasks. Our approach circumvents this challenge by capitalizing on CLIP's capability to process both state features and image inputs effectively. Given a pair of consecutive observations, our model excels in identifying the motion executed between them. We showcase results spanning various robotic activities, such as directing a gripper to a designated target and adjusting the position of a cube. Through experimental evaluations, we underline the proficiency of our method in precisely deducing motion and its promise to enhance reinforcement learning training in the realm of robotics.
Paper Structure (16 sections, 8 equations, 3 figures)

This paper contains 16 sections, 8 equations, 3 figures.

Figures (3)

  • Figure 1: The image observations are encoded into observation features by the CLIP Image Encoder, and the text descriptions of abstract motions are encoded into text features by the CLIP Text Encoder. Cosine similarity is computed for each pair of observation-text features, and the action is determined to match the abstract motion which has the maximum similarity.
  • Figure 2: A "pick place" task can be decomposed into an abstract motion sequence: reach puck, grasp puck, move puck to goal, and puck is near the goal. Actions determined to match each abstract motion are assigned corresponding rewards.
  • Figure 3: Success rate in different Metaworld environments.