Table of Contents
Fetching ...

Critic in the Loop: A Tri-System VLA Framework for Robust Long-Horizon Manipulation

Pengfei Yi, Yingjie Ma, Wenjiang Xu, Yanan Hao, Shuai Gan, Wanting Li, Shanlin Zhong

TL;DR

Critic in the Loop is introduced, an adaptive hierarchical framework driven by dynamic VLM-Expert scheduling that minimizes expensive VLM queries, while substantially enhancing system robustness and autonomy in out-of-distribution (OOD) scenarios.

Abstract

Balancing high-level semantic reasoning with low-level reactive control remains a core challenge in visual robotic manipulation. While Vision-Language Models (VLMs) excel at cognitive planning, their inference latency precludes real-time execution. Conversely, fast Vision-Language-Action (VLA) models often lack the semantic depth required for complex, long-horizon tasks. To bridge this gap, we introduce Critic in the Loop, an adaptive hierarchical framework driven by dynamic VLM-Expert scheduling. At its core is a bionic Tri-System architecture comprising a VLM brain for global reasoning, a VLA cerebellum for reactive execution, and a lightweight visual Critic. By continuously monitoring the workspace, the Critic dynamically routes control authority. It sustains rapid closed-loop execution via the VLA for routine subtasks, and adaptively triggers the VLM for replanning upon detecting execution anomalies such as task stagnation or failures. Furthermore, our architecture seamlessly integrates human-inspired rules to intuitively break infinite retry loops. This visually-grounded scheduling minimizes expensive VLM queries, while substantially enhancing system robustness and autonomy in out-of-distribution (OOD) scenarios. Comprehensive experiments on challenging, long-horizon manipulation benchmarks reveal that our approach achieves state-of-the-art performance.

Critic in the Loop: A Tri-System VLA Framework for Robust Long-Horizon Manipulation

TL;DR

Critic in the Loop is introduced, an adaptive hierarchical framework driven by dynamic VLM-Expert scheduling that minimizes expensive VLM queries, while substantially enhancing system robustness and autonomy in out-of-distribution (OOD) scenarios.

Abstract

Balancing high-level semantic reasoning with low-level reactive control remains a core challenge in visual robotic manipulation. While Vision-Language Models (VLMs) excel at cognitive planning, their inference latency precludes real-time execution. Conversely, fast Vision-Language-Action (VLA) models often lack the semantic depth required for complex, long-horizon tasks. To bridge this gap, we introduce Critic in the Loop, an adaptive hierarchical framework driven by dynamic VLM-Expert scheduling. At its core is a bionic Tri-System architecture comprising a VLM brain for global reasoning, a VLA cerebellum for reactive execution, and a lightweight visual Critic. By continuously monitoring the workspace, the Critic dynamically routes control authority. It sustains rapid closed-loop execution via the VLA for routine subtasks, and adaptively triggers the VLM for replanning upon detecting execution anomalies such as task stagnation or failures. Furthermore, our architecture seamlessly integrates human-inspired rules to intuitively break infinite retry loops. This visually-grounded scheduling minimizes expensive VLM queries, while substantially enhancing system robustness and autonomy in out-of-distribution (OOD) scenarios. Comprehensive experiments on challenging, long-horizon manipulation benchmarks reveal that our approach achieves state-of-the-art performance.
Paper Structure (29 sections, 2 equations, 6 figures, 2 tables, 1 algorithm)

This paper contains 29 sections, 2 equations, 6 figures, 2 tables, 1 algorithm.

Figures (6)

  • Figure 1: Overview. (a) Previous static dual-system pipeline. (b) Ours dynamically routes between a high-level VLM and VLA via an independent Critic. The right radar chart highlights our superior success rates over the baseline across diverse scenes. The bottom panels showcase real-world capabilities, notably demonstrating out-of-distribution (OOD) generalization where our system successfully picks and places a cup using an OOD left arm, despite lacking left-arm training data for this task.
  • Figure 2: Overview of the proposed method. Our Tri-System VLA architecture decouples cognitive reasoning from continuous control via event-driven scheduling. System 2 (Brain) uses a VLM to generate semantic subtasks, while System 1 (Cerebellum) translates them into continuous actions. System 3 (Critic) asynchronously monitors execution, detects anomalies, and integrates human-inspired heuristic rules. By triggering the Brain for replanning only upon completion, failure, or interruption, this asynchronous design effectively bypasses VLM inference bottlenecks in robot control.
  • Figure 3: Overview of the Tri-System VLA execution timeline. The System Three Critic ($V.$) asynchronously evaluates progress and governs the dynamic scheduling between the System Two Brain ($S2.$) and the System One Cerebellum ($S1.$).
  • Figure 4: Overview of the automated subtask annotation pipeline. Raw end-effector trajectories are processed into candidate waypoints (top right) via geometric filtering and gripper state analysis. Paired with corresponding visual frames, a VLM retrieves precise semantic labels, resulting in the continuous temporal segmentation and subtask annotation shown at the bottom.
  • Figure 5: Qualitative results of real-world evaluations. Our proposed system demonstrates robust capabilities across complex scenarios: (Left) stable, long-horizon manipulation of deformable objects; (Middle) out-of-distribution (OOD) generalization via human-inspired rule to resolve execution stagnation; and (Right) real-time anomaly detection (triggered by the <aci> token) followed by autonomous recovery.
  • ...and 1 more figures