Table of Contents
Fetching ...

ClevrSkills: Compositional Language and Visual Reasoning in Robotics

Sanjay Haresh, Daniel Dijkman, Apratim Bhattacharyya, Roland Memisevic

TL;DR

ClevrSkills is an environment suite developed on top of the ManiSkill2 simulator and an accompanying dataset that benchmark multiple different VLM baselines and shows that even after being pre-trained on large numbers of tasks, these models fail on compositional reasoning in robotics tasks.

Abstract

Robotics tasks are highly compositional by nature. For example, to perform a high-level task like cleaning the table a robot must employ low-level capabilities of moving the effectors to the objects on the table, pick them up and then move them off the table one-by-one, while re-evaluating the consequently dynamic scenario in the process. Given that large vision language models (VLMs) have shown progress on many tasks that require high level, human-like reasoning, we ask the question: if the models are taught the requisite low-level capabilities, can they compose them in novel ways to achieve interesting high-level tasks like cleaning the table without having to be explicitly taught so? To this end, we present ClevrSkills - a benchmark suite for compositional reasoning in robotics. ClevrSkills is an environment suite developed on top of the ManiSkill2 simulator and an accompanying dataset. The dataset contains trajectories generated on a range of robotics tasks with language and visual annotations as well as multi-modal prompts as task specification. The suite includes a curriculum of tasks with three levels of compositional understanding, starting with simple tasks requiring basic motor skills. We benchmark multiple different VLM baselines on ClevrSkills and show that even after being pre-trained on large numbers of tasks, these models fail on compositional reasoning in robotics tasks.

ClevrSkills: Compositional Language and Visual Reasoning in Robotics

TL;DR

ClevrSkills is an environment suite developed on top of the ManiSkill2 simulator and an accompanying dataset that benchmark multiple different VLM baselines and shows that even after being pre-trained on large numbers of tasks, these models fail on compositional reasoning in robotics tasks.

Abstract

Robotics tasks are highly compositional by nature. For example, to perform a high-level task like cleaning the table a robot must employ low-level capabilities of moving the effectors to the objects on the table, pick them up and then move them off the table one-by-one, while re-evaluating the consequently dynamic scenario in the process. Given that large vision language models (VLMs) have shown progress on many tasks that require high level, human-like reasoning, we ask the question: if the models are taught the requisite low-level capabilities, can they compose them in novel ways to achieve interesting high-level tasks like cleaning the table without having to be explicitly taught so? To this end, we present ClevrSkills - a benchmark suite for compositional reasoning in robotics. ClevrSkills is an environment suite developed on top of the ManiSkill2 simulator and an accompanying dataset. The dataset contains trajectories generated on a range of robotics tasks with language and visual annotations as well as multi-modal prompts as task specification. The suite includes a curriculum of tasks with three levels of compositional understanding, starting with simple tasks requiring basic motor skills. We benchmark multiple different VLM baselines on ClevrSkills and show that even after being pre-trained on large numbers of tasks, these models fail on compositional reasoning in robotics tasks.

Paper Structure

This paper contains 40 sections, 8 figures, 5 tables.

Figures (8)

  • Figure 1: The ClevrSkills environment suite includes support for multi-modal prompts as task specification, multi-camera RGB observations, dense hierarchical action labels, action demonstrations in end-effector space and support for RL with dense rewards for all the tasks.
  • Figure 2: Example task compositions in ClevrSkills. Higher level tasks in ClevrSkills are built on skills acquired from lower level tasks (L0 $\rightarrow$ L1 $\rightarrow$ L2).
  • Figure 3: Left: The median length of an episode across task levels showing significant increase in episode length as we go from lower to higher levels of compositionality. Right: The mean number of solvers used by the oracle to complete a task across task levels. Each solver solves for a specific sub-task, showing higher levels have increasingly compositional tasks.
  • Figure 4: The StreamRoboLM model in contrast to state of the art models, e.g., RoboFlamingo (c.f., Fig. 1 in li2023vision), can auto-regressively process videos as input, which helps for success in long-horizon tasks of ClevrSkills.
  • Figure 5: Per task success rate on L0 tasks.
  • ...and 3 more figures