Table of Contents
Fetching ...

RoboInter: A Holistic Intermediate Representation Suite Towards Robotic Manipulation

Hao Li, Ziqin Wang, Zi-han Ding, Shuai Yang, Yilun Chen, Yang Tian, Xiaolin Hu, Tai Wang, Dahua Lin, Feng Zhao, Si Liu, Jiangmiao Pang

TL;DR

RoboInter tackles the generalization bottleneck in vision-language-action robotics by introducing a holistic intermediate representation suite that includes data, benchmarks, and models. The RoboInter-Data dataset provides dense per-frame annotations for a broad set of representations, RoboInter-VQA benchmarks embodied reasoning, and RoboInter-VLA enables plan-then-execute learning with intermediate supervision. Empirical results show that intermediate representations improve planning, grounding, and long-horizon control across diverse real-world scenes and embodiments, evidenced by both open-loop and closed-loop evaluations. The work enables scalable pretraining and cross-embodiment research by openly releasing datasets, benchmarks, and models, fostering broader advances in embodied AI.

Abstract

Advances in large vision-language models (VLMs) have stimulated growing interest in vision-language-action (VLA) systems for robot manipulation. However, existing manipulation datasets remain costly to curate, highly embodiment-specific, and insufficient in coverage and diversity, thereby hindering the generalization of VLA models. Recent approaches attempt to mitigate these limitations via a plan-then-execute paradigm, where high-level plans (e.g., subtasks, trace) are first generated and subsequently translated into low-level actions, but they critically rely on extra intermediate supervision, which is largely absent from existing datasets. To bridge this gap, we introduce the RoboInter Manipulation Suite, a unified resource including data, benchmarks, and models of intermediate representations for manipulation. It comprises RoboInter-Tool, a lightweight GUI that enables semi-automatic annotation of diverse representations, and RoboInter-Data, a large-scale dataset containing over 230k episodes across 571 diverse scenes, which provides dense per-frame annotations over more than 10 categories of intermediate representations, substantially exceeding prior work in scale and annotation quality. Building upon this foundation, RoboInter-VQA introduces 9 spatial and 20 temporal embodied VQA categories to systematically benchmark and enhance the embodied reasoning capabilities of VLMs. Meanwhile, RoboInter-VLA offers an integrated plan-then-execute framework, supporting modular and end-to-end VLA variants that bridge high-level planning with low-level execution via intermediate supervision. In total, RoboInter establishes a practical foundation for advancing robust and generalizable robotic learning via fine-grained and diverse intermediate representations.

RoboInter: A Holistic Intermediate Representation Suite Towards Robotic Manipulation

TL;DR

RoboInter tackles the generalization bottleneck in vision-language-action robotics by introducing a holistic intermediate representation suite that includes data, benchmarks, and models. The RoboInter-Data dataset provides dense per-frame annotations for a broad set of representations, RoboInter-VQA benchmarks embodied reasoning, and RoboInter-VLA enables plan-then-execute learning with intermediate supervision. Empirical results show that intermediate representations improve planning, grounding, and long-horizon control across diverse real-world scenes and embodiments, evidenced by both open-loop and closed-loop evaluations. The work enables scalable pretraining and cross-embodiment research by openly releasing datasets, benchmarks, and models, fostering broader advances in embodied AI.

Abstract

Advances in large vision-language models (VLMs) have stimulated growing interest in vision-language-action (VLA) systems for robot manipulation. However, existing manipulation datasets remain costly to curate, highly embodiment-specific, and insufficient in coverage and diversity, thereby hindering the generalization of VLA models. Recent approaches attempt to mitigate these limitations via a plan-then-execute paradigm, where high-level plans (e.g., subtasks, trace) are first generated and subsequently translated into low-level actions, but they critically rely on extra intermediate supervision, which is largely absent from existing datasets. To bridge this gap, we introduce the RoboInter Manipulation Suite, a unified resource including data, benchmarks, and models of intermediate representations for manipulation. It comprises RoboInter-Tool, a lightweight GUI that enables semi-automatic annotation of diverse representations, and RoboInter-Data, a large-scale dataset containing over 230k episodes across 571 diverse scenes, which provides dense per-frame annotations over more than 10 categories of intermediate representations, substantially exceeding prior work in scale and annotation quality. Building upon this foundation, RoboInter-VQA introduces 9 spatial and 20 temporal embodied VQA categories to systematically benchmark and enhance the embodied reasoning capabilities of VLMs. Meanwhile, RoboInter-VLA offers an integrated plan-then-execute framework, supporting modular and end-to-end VLA variants that bridge high-level planning with low-level execution via intermediate supervision. In total, RoboInter establishes a practical foundation for advancing robust and generalizable robotic learning via fine-grained and diverse intermediate representations.
Paper Structure (46 sections, 1 equation, 40 figures, 24 tables)

This paper contains 46 sections, 1 equation, 40 figures, 24 tables.

Figures (40)

  • Figure 1: RoboInter manipulation suite includes annotation tools, annotated data, curated VQA dataset, and their applications in VLMs and VLAs. RoboInter provides a dataset with over 230k episodes (mainly from Droid khazatsky2024droid and RH20T rh20t) and 10+ types of intermediate representation annotations, named RoboInter-Data; a curated embodied VQA benchmark and dataset covering 29 spatial- and temporal-level categories, RoboInter-VQA; and an integrated plan-then-execute framework for training VLM and VLA models, RoboInter-VLA.
  • Figure 2: Overview of RoboInter-Data and RoboInter-VQA. We collect and annotate 230k manipulation episodes to obtain 10 types of intermediate representation annotations through Data Collection and Annotation & Check. We further construct a large-scale, diverse set of VQA spanning spatial and temporal dimensions. Statistics of raw manipulation episodes and the curated VQA are also provided.
  • Figure 3: Framework of RoboInter-VLA. Our model follows a plan-then-execute paradigm with a VLM-based Planner and an Executor. The Planner exhibits enhanced understanding and generation for manipulation, strong general grounding abilities, and robust perception across diverse scenes. The Executor shares the VLM backbone with the Planner. Three variants are supported, and intermediate representations in Flexible Chain-of-Thought (F-CoT) bridge planning and execution.
  • Figure 4: Open-loop evaluation in In-the-Wild setting. We report OLS with different error thresholds (@0.1 to @0.01) and the mean value.
  • Figure 5: Real-World Experiments. The top charts present results from 15 in-distribution (ID) and 15 out-of-distribution (OOD) trials. The bottom panel illustrates the OOD test setup. Notably, the performance drop from ID to OOD reflects each model’s generalization under distribution shift, where EC-E2E outperforms IC-E2E and exhibits a smaller ID→OOD degradation (8.3% vs. 19.0%), showing the consistent conclusion with the Open-Loop Evaluation. Key steps are marked with number, along with an end-execution thumbnail. Experimental results of RoboInter-Modular is in Table.\ref{['tab:modular_realworld']}.
  • ...and 35 more figures