Table of Contents
Fetching ...

Adapt2Reward: Adapting Video-Language Models to Generalizable Robotic Rewards via Failure Prompts

Yanting Yang, Minghao Chen, Qibo Qiu, Jiahao Wu, Wenxiao Wang, Binbin Lin, Ziyu Guan, Xiaofei He

TL;DR

This paper tackles the challenge of learning a generalizable language-conditioned reward function for robots when robot data is scarce. It proposes Adapt2Reward, a framework that incorporates learnable failure prompts and cross-domain contrastive learning to align human and robot video-language representations, aided by clustering of failure videos into distinct failure modes. The model is trained with a failure-prompt pool, domain-specific prompts, and a modified video-language contrastive objective, and is executed via Visual Model Predictive Control (VMPC). Empirical results in MetaWorld-like and Concept2Robot environments show improved environment and task generalization, robust performance under viewpoint changes, and a clear advantage from using failure data over BCE losses. Overall, Adapt2Reward enables robust, language-conditioned reward learning with limited robot demonstrations by leveraging rich human data and structured failure information.

Abstract

For a general-purpose robot to operate in reality, executing a broad range of instructions across various environments is imperative. Central to the reinforcement learning and planning for such robotic agents is a generalizable reward function. Recent advances in vision-language models, such as CLIP, have shown remarkable performance in the domain of deep learning, paving the way for open-domain visual recognition. However, collecting data on robots executing various language instructions across multiple environments remains a challenge. This paper aims to transfer video-language models with robust generalization into a generalizable language-conditioned reward function, only utilizing robot video data from a minimal amount of tasks in a singular environment. Unlike common robotic datasets used for training reward functions, human video-language datasets rarely contain trivial failure videos. To enhance the model's ability to distinguish between successful and failed robot executions, we cluster failure video features to enable the model to identify patterns within. For each cluster, we integrate a newly trained failure prompt into the text encoder to represent the corresponding failure mode. Our language-conditioned reward function shows outstanding generalization to new environments and new instructions for robot planning and reinforcement learning.

Adapt2Reward: Adapting Video-Language Models to Generalizable Robotic Rewards via Failure Prompts

TL;DR

This paper tackles the challenge of learning a generalizable language-conditioned reward function for robots when robot data is scarce. It proposes Adapt2Reward, a framework that incorporates learnable failure prompts and cross-domain contrastive learning to align human and robot video-language representations, aided by clustering of failure videos into distinct failure modes. The model is trained with a failure-prompt pool, domain-specific prompts, and a modified video-language contrastive objective, and is executed via Visual Model Predictive Control (VMPC). Empirical results in MetaWorld-like and Concept2Robot environments show improved environment and task generalization, robust performance under viewpoint changes, and a clear advantage from using failure data over BCE losses. Overall, Adapt2Reward enables robust, language-conditioned reward learning with limited robot demonstrations by leveraging rich human data and structured failure information.

Abstract

For a general-purpose robot to operate in reality, executing a broad range of instructions across various environments is imperative. Central to the reinforcement learning and planning for such robotic agents is a generalizable reward function. Recent advances in vision-language models, such as CLIP, have shown remarkable performance in the domain of deep learning, paving the way for open-domain visual recognition. However, collecting data on robots executing various language instructions across multiple environments remains a challenge. This paper aims to transfer video-language models with robust generalization into a generalizable language-conditioned reward function, only utilizing robot video data from a minimal amount of tasks in a singular environment. Unlike common robotic datasets used for training reward functions, human video-language datasets rarely contain trivial failure videos. To enhance the model's ability to distinguish between successful and failed robot executions, we cluster failure video features to enable the model to identify patterns within. For each cluster, we integrate a newly trained failure prompt into the text encoder to represent the corresponding failure mode. Our language-conditioned reward function shows outstanding generalization to new environments and new instructions for robot planning and reinforcement learning.
Paper Structure (16 sections, 10 equations, 5 figures, 2 tables)

This paper contains 16 sections, 10 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Human video-language datasets typically lack failure videos. This limitation will result in models that are effective at categorizing tasks but exhibit diminished efficiency in distinguishing between successful and unsuccessful task executions.
  • Figure 2: Adapt2Reward Architecture. We propose Adapt2Reward which incorporates learnable failure prompts into the model's architecture. Our approach starts with clustering failure videos to discern specific patterns. Each identified cluster is then associated with a unique failure prompt. Additionally, we employ cross-domain contrastive learning and domain-specific prompt learning to align text and video representations between human and robot domains.
  • Figure 3: (a) MetaWorld environments (left) consist of the original training environment and three test environments with color, viewpoint, and object arrangement modifications. Concept2Robot environments (right) include the training environment and the testing environments that change the viewpoint or include distractor objects. (b) Comparison of Random Policy, VMPC with Concept2Robot, DVD, LIV, and Adapt2Reward in MetaWorld environments. The depicted bars represent the mean success rate across 4 target tasks computed over 3 seeds of 100 trials.
  • Figure 4: Ablation study. (a) Different training methods for failure data. (b) Varying K. (c) Training with different sources of failure data. (d) The distribution differences in rewards obtained by different reward methods for an unseen task, with scattered points representing normalized reward values for different trajectories.
  • Figure 5: Task Generalization in C2R-Envs. We report the success rates of reinforcement learning on 68 tasks with manually crafted rewards, Concept2Robot, and Adapt2Reward. The majority of policies obtained by Adapt2Reward match or exceed "manually crafted reward". In comparison to Concept2Robot, Adapt2Reward demonstrated superior performance across most tasks.