RoboCerebra: A Large-scale Benchmark for Long-horizon Robotic Manipulation Evaluation
Songhao Han, Boxiang Qiu, Yue Liao, Siyuan Huang, Chen Gao, Shuicheng Yan, Si Liu
TL;DR
RoboCerebra tackles long-horizon robotic manipulation by shifting from reactive System 1 control to deliberative System 2 planning. It introduces a large-scale, memory-rich dataset generated via a cascaded GPT-based task-creation pipeline with human demonstrations, alongside a hierarchical planning framework that combines a high-level VLM planner with a low-level VLA controller and a shared memory. The benchmark enables multi-dimensional evaluation of planning, reflection, and memory, and reveals that state-of-the-art VLMs exhibit measurable gains when integrated into hierarchical planning, though there remains a gap to ground-truth planning, highlighting the need for stronger temporal grounding and memory-aware reasoning. Overall, RoboCerebra provides a scalable, reproducible platform for advancing temporally abstracted robotic planning with potential impact on instruction-conditioned, generalizable robotics in dynamic environments.
Abstract
Recent advances in vision-language models (VLMs) have enabled instruction-conditioned robotic systems with improved generalization. However, most existing work focuses on reactive System 1 policies, underutilizing VLMs' strengths in semantic reasoning and long-horizon planning. These System 2 capabilities-characterized by deliberative, goal-directed thinking-remain under explored due to the limited temporal scale and structural complexity of current benchmarks. To address this gap, we introduce RoboCerebra, a benchmark for evaluating high-level reasoning in long-horizon robotic manipulation. RoboCerebra includes: (1) a large-scale simulation dataset with extended task horizons and diverse subtask sequences in household environments; (2) a hierarchical framework combining a high-level VLM planner with a low-level vision-language-action (VLA) controller; and (3) an evaluation protocol targeting planning, reflection, and memory through structured System 1-System 2 interaction. The dataset is constructed via a top-down pipeline, where GPT generates task instructions and decomposes them into subtask sequences. Human operators execute the subtasks in simulation, yielding high-quality trajectories with dynamic object variations. Compared to prior benchmarks, RoboCerebra features significantly longer action sequences and denser annotations. We further benchmark state-of-the-art VLMs as System 2 modules and analyze their performance across key cognitive dimensions, advancing the development of more capable and generalizable robotic planners.
