Table of Contents
Fetching ...

ForceVLA2: Unleashing Hybrid Force-Position Control with Force Awareness for Contact-Rich Manipulation

Yang Li, Zhaxizhuoma, Hongru Jiang, Junjie Xia, Hongquan Zhang, Jinda Du, Yunsong Zhou, Jia Zeng, Ce Hao, Jieji Ren, Qiaojun Yu, Cewu Lu, Yu Qiao, Jiangmiao Pang

Abstract

Embodied intelligence for contact-rich manipulation has predominantly relied on position control, while explicit awareness and regulation of interaction forces remain under-explored, limiting stability, precision, and robustness in real-world tasks. We propose ForceVLA2, an end-to-end vision-language-action framework that equips robots with hybrid force-position control and explicit force awareness. ForceVLA2 introduces force-based prompts into the VLM expert to construct force-aware task concepts across stages, and employs a Cross-Scale Mixture-of-Experts (MoE) in the action expert to adaptively fuse these concepts with real-time interaction forces for closed-loop hybrid force-position regulation. To support learning and evaluation, we construct ForceVLA2-Dataset, containing 1,000 trajectories over 5 contact-rich tasks, including wiping, pressing, and assembling, with multi-view images, task prompts, proprioceptive state, and force signals. Extensive experiments show that ForceVLA2 substantially improves success rates and reliability in contact-rich manipulation, outperforming pi0 and pi0.5 by 48.0% and 35.0%, respectively, across the 5 tasks, and mitigating common failure modes such as arm overload and unstable contact, thereby actively advancing force-aware interactive physical intelligence in VLAs. The project page is available at https://sites.google.com/view/force-vla2/home.

ForceVLA2: Unleashing Hybrid Force-Position Control with Force Awareness for Contact-Rich Manipulation

Abstract

Embodied intelligence for contact-rich manipulation has predominantly relied on position control, while explicit awareness and regulation of interaction forces remain under-explored, limiting stability, precision, and robustness in real-world tasks. We propose ForceVLA2, an end-to-end vision-language-action framework that equips robots with hybrid force-position control and explicit force awareness. ForceVLA2 introduces force-based prompts into the VLM expert to construct force-aware task concepts across stages, and employs a Cross-Scale Mixture-of-Experts (MoE) in the action expert to adaptively fuse these concepts with real-time interaction forces for closed-loop hybrid force-position regulation. To support learning and evaluation, we construct ForceVLA2-Dataset, containing 1,000 trajectories over 5 contact-rich tasks, including wiping, pressing, and assembling, with multi-view images, task prompts, proprioceptive state, and force signals. Extensive experiments show that ForceVLA2 substantially improves success rates and reliability in contact-rich manipulation, outperforming pi0 and pi0.5 by 48.0% and 35.0%, respectively, across the 5 tasks, and mitigating common failure modes such as arm overload and unstable contact, thereby actively advancing force-aware interactive physical intelligence in VLAs. The project page is available at https://sites.google.com/view/force-vla2/home.
Paper Structure (17 sections, 17 equations, 8 figures, 5 tables)

This paper contains 17 sections, 17 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: ForceVLA2 concept. Contact-rich manipulation requires force regulation, beyond visual and state observations (left). ForceVLA2 integrates force information across multiple scales, enabling rich modeling of contact dynamics. It builds force awareness into task planning through incoming force signals, and it outputs hybrid force–position actions with dynamic balance (right).
  • Figure 2: Framework of ForceVLA2. ForceVLA2 takes multi-view images, task and force prompts, and proprioceptive states (EE pose and force) as input. Force is injected at multiple scales: as sub-task prompts fused with images in the vision-language model, and as force tokens combined with EE pose in the multimodal encoder, with a bypass to preserve raw signals. The cross-scale MoE integrates these modalities to produce hybrid force–position actions and track sub-task progress for adaptive, contact-rich manipulation.
  • Figure 3: The illustration of ForceVLA2-Dataset. (a) ForceVLA2-Dataset is the first dataset with force prompts for task decomposition and the only one providing force-control supervision. (b) It features 1,000 demonstrations across five contact-rich tasks.
  • Figure 4: The dataset collection system. A Flexiv arm is driven by manually controlled GELLO wu2023gello to accomplish dexterous tasks and record images, force, as well as the pose of the robot.
  • Figure 5: Qualitative results on typical manipulation tasks (compared with ForceVLA and $\pi$ serials). ForceVLA2 completes these tasks with higher success rates and faster execution while avoiding arm overload, demonstrating superior compliance.
  • ...and 3 more figures