Table of Contents
Fetching ...

CLIP feature-based randomized control using images and text for multiple tasks and robots

Kazuki Shibata, Hideki Deguchi, Shun Taguchi

TL;DR

This work addresses the high cost of learning control policies for new tasks and robots by introducing a CLIP feature-based randomized control framework that operates without policy learning. By measuring the similarity between image-motion cues and text-based task descriptions using CLIP, and alternating stochastic and gradient-informed movements, the approach generalizes across multiple tasks and robots. Fine-tuning CLIP with multitask data further improves performance, as demonstrated in both multitask robot-arm simulations and real-world experiments with a two-wheeled robot and a robot arm. The results show notable gains over reinforcement learning baselines and highlight the importance of task-specific CLIP adaptation for robust text-driven, vision-language control with limited data.

Abstract

This study presents a control framework leveraging vision language models (VLMs) for multiple tasks and robots. Notably, existing control methods using VLMs have achieved high performance in various tasks and robots in the training environment. However, these methods incur high costs for learning control policies for tasks and robots other than those in the training environment. Considering the application of industrial and household robots, learning in novel environments where robots are introduced is challenging. To address this issue, we propose a control framework that does not require learning control policies. Our framework combines the vision-language CLIP model with a randomized control. CLIP computes the similarity between images and texts by embedding them in the feature space. This study employs CLIP to compute the similarity between camera images and text representing the target state. In our method, the robot is controlled by a randomized controller that simultaneously explores and increases the similarity gradients. Moreover, we fine-tune the CLIP to improve the performance of the proposed method. Consequently, we confirm the effectiveness of our approach through a multitask simulation and a real robot experiment using a two-wheeled robot and robot arm.

CLIP feature-based randomized control using images and text for multiple tasks and robots

TL;DR

This work addresses the high cost of learning control policies for new tasks and robots by introducing a CLIP feature-based randomized control framework that operates without policy learning. By measuring the similarity between image-motion cues and text-based task descriptions using CLIP, and alternating stochastic and gradient-informed movements, the approach generalizes across multiple tasks and robots. Fine-tuning CLIP with multitask data further improves performance, as demonstrated in both multitask robot-arm simulations and real-world experiments with a two-wheeled robot and a robot arm. The results show notable gains over reinforcement learning baselines and highlight the importance of task-specific CLIP adaptation for robust text-driven, vision-language control with limited data.

Abstract

This study presents a control framework leveraging vision language models (VLMs) for multiple tasks and robots. Notably, existing control methods using VLMs have achieved high performance in various tasks and robots in the training environment. However, these methods incur high costs for learning control policies for tasks and robots other than those in the training environment. Considering the application of industrial and household robots, learning in novel environments where robots are introduced is challenging. To address this issue, we propose a control framework that does not require learning control policies. Our framework combines the vision-language CLIP model with a randomized control. CLIP computes the similarity between images and texts by embedding them in the feature space. This study employs CLIP to compute the similarity between camera images and text representing the target state. In our method, the robot is controlled by a randomized controller that simultaneously explores and increases the similarity gradients. Moreover, we fine-tune the CLIP to improve the performance of the proposed method. Consequently, we confirm the effectiveness of our approach through a multitask simulation and a real robot experiment using a two-wheeled robot and robot arm.
Paper Structure (17 sections, 11 equations, 5 figures, 7 tables, 1 algorithm)

This paper contains 17 sections, 11 equations, 5 figures, 7 tables, 1 algorithm.

Figures (5)

  • Figure 1: Chair rearrangement task using CLIP feature-based randomized control. The text instruction is “place a green chair under the table.”
  • Figure 2: Overview of our control framework
  • Figure 3: Simulation environment (the green dot indicates the target position of the handle)
  • Figure 4: Experimental configuration
  • Figure 5: An example of control results when applying each method