Table of Contents
Fetching ...

Contact-Anchored Policies: Contact Conditioning Creates Strong Robot Utility Models

Zichen Jeff Cui, Omar Rayyan, Haritheja Etukuru, Bowen Tan, Zavier Andrianarivo, Zicheng Teng, Yihang Zhou, Krish Mehta, Nicholas Wojno, Kevin Yuanbo Wu, Manan H Anjaria, Ziyuan Wu, Manrong Mao, Guangxun Zhang, Binit Shah, Yejin Kim, Soumith Chintala, Lerrel Pinto, Nur Muhammad Mahi Shafiullah

TL;DR

This work introduces Contact-Anchored Policies (CAP), a framework that replaces language-based conditioning with contact-based anchors to ground robot manipulation. CAP factors behavior into modular robot utility models and employs a fast simulation-in-the-loop pipeline (EgoGym) to rapidly iterate and diagnose failure modes, achieving strong zero-shot generalization on Pick, Open, and Close tasks with only 23 hours of demonstrations. The approach demonstrates cross-embodiment transfer, competitive or superior zero-shot performance versus state-of-the-art vision-language-action models, and the capacity to chain CAPs via tool calling for long-horizon tasks, with sim-to-real alignment validated through external evaluations. The work provides open-source software, datasets, and a practical framework for researchers with limited resources to study emergent general manipulation abilities.

Abstract

The prevalent paradigm in robot learning attempts to generalize across environments, embodiments, and tasks with language prompts at runtime. A fundamental tension limits this approach: language is often too abstract to guide the concrete physical understanding required for robust manipulation. In this work, we introduce Contact-Anchored Policies (CAP), which replace language conditioning with points of physical contact in space. Simultaneously, we structure CAP as a library of modular utility models rather than a monolithic generalist policy. This factorization allows us to implement a real-to-sim iteration cycle: we build EgoGym, a lightweight simulation benchmark, to rapidly identify failure modes and refine our models and datasets prior to real-world deployment. We show that by conditioning on contact and iterating via simulation, CAP generalizes to novel environments and embodiments out of the box on three fundamental manipulation skills while using only 23 hours of demonstration data, and outperforms large, state-of-the-art VLAs in zero-shot evaluations by 56%. All model checkpoints, codebase, hardware, simulation, and datasets will be open-sourced. Project page: https://cap-policy.github.io/

Contact-Anchored Policies: Contact Conditioning Creates Strong Robot Utility Models

TL;DR

This work introduces Contact-Anchored Policies (CAP), a framework that replaces language-based conditioning with contact-based anchors to ground robot manipulation. CAP factors behavior into modular robot utility models and employs a fast simulation-in-the-loop pipeline (EgoGym) to rapidly iterate and diagnose failure modes, achieving strong zero-shot generalization on Pick, Open, and Close tasks with only 23 hours of demonstrations. The approach demonstrates cross-embodiment transfer, competitive or superior zero-shot performance versus state-of-the-art vision-language-action models, and the capacity to chain CAPs via tool calling for long-horizon tasks, with sim-to-real alignment validated through external evaluations. The work provides open-source software, datasets, and a practical framework for researchers with limited resources to study emergent general manipulation abilities.

Abstract

The prevalent paradigm in robot learning attempts to generalize across environments, embodiments, and tasks with language prompts at runtime. A fundamental tension limits this approach: language is often too abstract to guide the concrete physical understanding required for robust manipulation. In this work, we introduce Contact-Anchored Policies (CAP), which replace language conditioning with points of physical contact in space. Simultaneously, we structure CAP as a library of modular utility models rather than a monolithic generalist policy. This factorization allows us to implement a real-to-sim iteration cycle: we build EgoGym, a lightweight simulation benchmark, to rapidly identify failure modes and refine our models and datasets prior to real-world deployment. We show that by conditioning on contact and iterating via simulation, CAP generalizes to novel environments and embodiments out of the box on three fundamental manipulation skills while using only 23 hours of demonstration data, and outperforms large, state-of-the-art VLAs in zero-shot evaluations by 56%. All model checkpoints, codebase, hardware, simulation, and datasets will be open-sourced. Project page: https://cap-policy.github.io/
Paper Structure (53 sections, 19 figures, 4 tables)

This paper contains 53 sections, 19 figures, 4 tables.

Figures (19)

  • Figure 1: We introduce Contact-Anchored Policies (CAP), a method to conditioning multimodal policies with physical contact information. Such policies are able to generalize zero-shot to novel objects and scenes with orders of magnitude less data, compute, and model parameters compared to frontier behavior model, while outperforming them on atomic skills trained with CAP.
  • Figure 2: The process of data labeling, training, and inference for Contact-Anchored Policies. (a) During training, we detect the contact point from the data and label the trajectory with hindsight relabeling. (b) During inference, we use a user click or VLM conditioned on user command to derive the contact condition. In both cases, the contact tokens and visual tokens get concatenated and passed to the model which uses them as input to predict the actions.
  • Figure 3: Our data collection tool and matching robot deployment gripper.
  • Figure 4: EgoGym: a lightweight simulation-in-the-loop environment used for quick development and evaluation of Contact-Anchored Policies (CAPs). EgoGym enables fast checkpoint evaluation and failure mode discovery across Pick, Open, and Close tasks using procedurally generated scenes.
  • Figure 5: Evaluation environments for CAP. Each scene and object combination has 10 trials, so Pick checkpoints are evaluated for 250 episodes and Open or Close checkpoints are evaluated for 100 episodes.
  • ...and 14 more figures