Table of Contents
Fetching ...

FAVLA: A Force-Adaptive Fast-Slow VLA model for Contact-Rich Robotic Manipulation

Yao Li, Peiyuan Tang, Wuyang Zhang, Chengyang Zhu, Yifan Duan, Weikai Shi, Xiaodong Zhang, Zijiang Yang, Jianmin Ji, Yanyong Zhang

TL;DR

FAVLA is proposed, a force-adaptive fast-slow VLA that decouples slow perception planning from fast contact-aware control, and significantly outperforms baselines, achieving superior reactivity and success rates, especially with a smaller contact force during manipulation.

Abstract

Force/torque feedback can substantially improve Vision-Language-Action (VLA) models on contact-rich manipulation, but most existing approaches fuse all modalities at a single operating frequency. This design ignores the mismatched sampling rates of real robot sensors, forcing downsampling of the high-frequency contact cues needed for reactive correction. Combined with common VLM-action-expert (AE) pipelines that execute action chunks largely open loop between expensive VLM updates, unified-frequency fusion often yields delayed responses to impacts, stick-slip, and force spikes. We propose FAVLA, a force-adaptive fast-slow VLA that decouples slow perception planning from fast contact-aware control. FAVLA runs a slow VLM at a fixed low frequency to encode modalities to produce latent representations and to predict near-future force variation. A fast AE then executes at a variable high frequency, conditioning on the latest force sequence data to generate reactive actions. We further introduce a force adapter that injects high-frequency force features into multiple AE layers, and adaptively schedules the AE's execution frequency based on the VLM's predicted force variation. Extensive experiments on contact-rich tasks demonstrate that FAVLA significantly outperforms baselines, achieving superior reactivity and success rates, especially with a smaller contact force during manipulation.

FAVLA: A Force-Adaptive Fast-Slow VLA model for Contact-Rich Robotic Manipulation

TL;DR

FAVLA is proposed, a force-adaptive fast-slow VLA that decouples slow perception planning from fast contact-aware control, and significantly outperforms baselines, achieving superior reactivity and success rates, especially with a smaller contact force during manipulation.

Abstract

Force/torque feedback can substantially improve Vision-Language-Action (VLA) models on contact-rich manipulation, but most existing approaches fuse all modalities at a single operating frequency. This design ignores the mismatched sampling rates of real robot sensors, forcing downsampling of the high-frequency contact cues needed for reactive correction. Combined with common VLM-action-expert (AE) pipelines that execute action chunks largely open loop between expensive VLM updates, unified-frequency fusion often yields delayed responses to impacts, stick-slip, and force spikes. We propose FAVLA, a force-adaptive fast-slow VLA that decouples slow perception planning from fast contact-aware control. FAVLA runs a slow VLM at a fixed low frequency to encode modalities to produce latent representations and to predict near-future force variation. A fast AE then executes at a variable high frequency, conditioning on the latest force sequence data to generate reactive actions. We further introduce a force adapter that injects high-frequency force features into multiple AE layers, and adaptively schedules the AE's execution frequency based on the VLM's predicted force variation. Extensive experiments on contact-rich tasks demonstrate that FAVLA significantly outperforms baselines, achieving superior reactivity and success rates, especially with a smaller contact force during manipulation.
Paper Structure (20 sections, 6 equations, 12 figures, 5 tables)

This paper contains 20 sections, 6 equations, 12 figures, 5 tables.

Figures (12)

  • Figure 1: A comparison between the previous unified-frequency scheme and our force-adaptive fast–slow scheme for the VLA model. (a) Previous methods process the vision, force, and proprioceptive state at a unified input frequency, generating open-loop action chunks that cannot promptly exploit high-frequency force feedback. (b) Our scheme applies a force-adaptive fast-slow scheme. The VLM runs at a slow frequency to encode visual and language context, while the AE runs at a force-adaptive high frequency using the latest force data to generate closed-loop action chunks, enabling reactive robot control.
  • Figure 2: Overview of the FAVLA. FAVLA integrates a large Slow VLM Backbone for semantic reasoning, and a smaller Fast Action Expert for responsive action control. The model processes multimodal inputs at different frequencies and generates action trajectories via the conditional Flow Matching. Notably, the inference frequency of the Action Expert is adaptively adjusted by the Force-Adaptive Fast-Slow Inference Strategy, ensuring there is a smaller contact force during manipulation.
  • Figure 3: Our force-adaptive fast-slow inference strategy. The fast AE runs multiple times within each action chunk, conditioned on real-time force. In each cycle, we fix the sampled noise and robot state, then perform the temporal ensemble on overlapped action chunks. The AE execution frequency is adaptively set from the VLM-predicted force variance.
  • Figure 4: Real-world task scenarios and execution phases. Tasks include (1) USB Insertion, (2) Gear Assembly, (3) Box Flipping, and (4) Board Wiping. These tasks present diverse challenges: tasks 1 and 2 require millimeter-level precision for successful alignment, while tasks 3 and 4 demand real-time force adjustment to handle constantly changing contact dynamics.
  • Figure 5: Experimental setup for robotic manipulation. (a) Overview of the hardware configuration, including the teleoperation device (3D Space Mouse), 6D-force and vision sensors, and the robotic gripper. (b) The set of experimental tasks.
  • ...and 7 more figures