Table of Contents
Fetching ...

RoboCerebra: A Large-scale Benchmark for Long-horizon Robotic Manipulation Evaluation

Songhao Han, Boxiang Qiu, Yue Liao, Siyuan Huang, Chen Gao, Shuicheng Yan, Si Liu

TL;DR

RoboCerebra tackles long-horizon robotic manipulation by shifting from reactive System 1 control to deliberative System 2 planning. It introduces a large-scale, memory-rich dataset generated via a cascaded GPT-based task-creation pipeline with human demonstrations, alongside a hierarchical planning framework that combines a high-level VLM planner with a low-level VLA controller and a shared memory. The benchmark enables multi-dimensional evaluation of planning, reflection, and memory, and reveals that state-of-the-art VLMs exhibit measurable gains when integrated into hierarchical planning, though there remains a gap to ground-truth planning, highlighting the need for stronger temporal grounding and memory-aware reasoning. Overall, RoboCerebra provides a scalable, reproducible platform for advancing temporally abstracted robotic planning with potential impact on instruction-conditioned, generalizable robotics in dynamic environments.

Abstract

Recent advances in vision-language models (VLMs) have enabled instruction-conditioned robotic systems with improved generalization. However, most existing work focuses on reactive System 1 policies, underutilizing VLMs' strengths in semantic reasoning and long-horizon planning. These System 2 capabilities-characterized by deliberative, goal-directed thinking-remain under explored due to the limited temporal scale and structural complexity of current benchmarks. To address this gap, we introduce RoboCerebra, a benchmark for evaluating high-level reasoning in long-horizon robotic manipulation. RoboCerebra includes: (1) a large-scale simulation dataset with extended task horizons and diverse subtask sequences in household environments; (2) a hierarchical framework combining a high-level VLM planner with a low-level vision-language-action (VLA) controller; and (3) an evaluation protocol targeting planning, reflection, and memory through structured System 1-System 2 interaction. The dataset is constructed via a top-down pipeline, where GPT generates task instructions and decomposes them into subtask sequences. Human operators execute the subtasks in simulation, yielding high-quality trajectories with dynamic object variations. Compared to prior benchmarks, RoboCerebra features significantly longer action sequences and denser annotations. We further benchmark state-of-the-art VLMs as System 2 modules and analyze their performance across key cognitive dimensions, advancing the development of more capable and generalizable robotic planners.

RoboCerebra: A Large-scale Benchmark for Long-horizon Robotic Manipulation Evaluation

TL;DR

RoboCerebra tackles long-horizon robotic manipulation by shifting from reactive System 1 control to deliberative System 2 planning. It introduces a large-scale, memory-rich dataset generated via a cascaded GPT-based task-creation pipeline with human demonstrations, alongside a hierarchical planning framework that combines a high-level VLM planner with a low-level VLA controller and a shared memory. The benchmark enables multi-dimensional evaluation of planning, reflection, and memory, and reveals that state-of-the-art VLMs exhibit measurable gains when integrated into hierarchical planning, though there remains a gap to ground-truth planning, highlighting the need for stronger temporal grounding and memory-aware reasoning. Overall, RoboCerebra provides a scalable, reproducible platform for advancing temporally abstracted robotic planning with potential impact on instruction-conditioned, generalizable robotics in dynamic environments.

Abstract

Recent advances in vision-language models (VLMs) have enabled instruction-conditioned robotic systems with improved generalization. However, most existing work focuses on reactive System 1 policies, underutilizing VLMs' strengths in semantic reasoning and long-horizon planning. These System 2 capabilities-characterized by deliberative, goal-directed thinking-remain under explored due to the limited temporal scale and structural complexity of current benchmarks. To address this gap, we introduce RoboCerebra, a benchmark for evaluating high-level reasoning in long-horizon robotic manipulation. RoboCerebra includes: (1) a large-scale simulation dataset with extended task horizons and diverse subtask sequences in household environments; (2) a hierarchical framework combining a high-level VLM planner with a low-level vision-language-action (VLA) controller; and (3) an evaluation protocol targeting planning, reflection, and memory through structured System 1-System 2 interaction. The dataset is constructed via a top-down pipeline, where GPT generates task instructions and decomposes them into subtask sequences. Human operators execute the subtasks in simulation, yielding high-quality trajectories with dynamic object variations. Compared to prior benchmarks, RoboCerebra features significantly longer action sequences and denser annotations. We further benchmark state-of-the-art VLMs as System 2 modules and analyze their performance across key cognitive dimensions, advancing the development of more capable and generalizable robotic planners.

Paper Structure

This paper contains 27 sections, 6 equations, 18 figures, 8 tables, 1 algorithm.

Figures (18)

  • Figure 1: We shift the focus of robotic imitation learning from fast, reactive System 1 behavior to slow, deliberative System 2 reasoning. To support this, we introduce RoboCerebra, a benchmark centered on long-horizon tasks composed of extended subtask sequences. (a) A top-down data generation pipeline uses an LLM to produce high-level task instructions and decompose them into subtasks. Human operators execute these in simulation to collect trajectories, with multi-stage verification ensuring quality and semantic consistency. (b) A dataset example showing a long, fine-grained subtask sequence under dynamically changing visual conditions. (c) RoboCerebra features significantly longer trajectories, approximately 6× those in existing robotic manipulation benchmarks.
  • Figure 2: Task generation pipeline in RoboCerebra. (a) Objects are randomly sampled from Libero's item library and converted into structured representations based on their categories and attributes. (b) The structured data is fed into an LLM to generate high-level task descriptions, which are hierarchically decomposed into low-level substeps. (c) The resulting task plan is parsed into executable simulator code via rule-based transformations. The generated scene is then validated through a closed-loop process involving symbolic checks and vision-language consistency via VLMs.
  • Figure 3: The statistical analysis of our RoboCerebra dataset. . (a) Distribution of minimum steps per task, highlighting its long-horizon nature. (b) Frequency of action categories, with dominant primitives (place, pick, pour) and rare fine-grained actions. (c) Number of action categories per task, showing high compositional diversity—over 10% of tasks involve five or more action types.
  • Figure 4: Overview of our HPE framework.Left: VLA model training uses paired images and single-step instructions to optimize a visual token policy. VLM training uses execution videos with success-labeled instructions for temporal grounding. Right: During execution, the VLM processes low-frequency observations to update low-level plans stored in memory bank, while the VLA consumes high-frequency observations to execute fine-grained actions based on the detailed plan.
  • Figure 5: Example from RoboCerebra for different tasks.
  • ...and 13 more figures