Table of Contents
Fetching ...

Agile-VLA: Few-Shot Industrial Pose Rectification via Implicit Affordance Anchoring

Teng Yan, Zhengyang Pei, Chengyu Shi, Yue Yu, Yikun Chen, Zilong Zhu, Zelin Fang, Kaile Guo, Zihang Wang, Peigen Tian, Bingzhuo Zhong

Abstract

Deploying Vision-Language-Action (VLA) models on resource-constrained edge platforms encounters a fundamental conflict between high-latency semantic inference and the high-frequency control required for dynamic manipulation. To address the challenge, this paper presents Agile-VLA, a hierarchical framework designed for industrial pose reorientation tasks on edge devices such as the NVIDIA Jetson Orin Nano. The core innovation is an Implicit Affordance Anchoring mechanism that directly maps geometric visual cues, specifically centroid and rim keypoint anchors, into structured parametric action primitives, thereby substantially reducing reliance on high-latency semantic inference during closed-loop control. By decoupling perception (10 Hz) from control (50 Hz) via an asynchronous dual-stream architecture, the system effectively mitigates the frequency mismatch inherent in edge-based robot learning. Experimental results on a standard 6-DoF manipulator demonstrate that Agile-VLA achieves robust rectification of complex, irregular workpieces using only 5-shot demonstrations through extrinsic dexterity.

Agile-VLA: Few-Shot Industrial Pose Rectification via Implicit Affordance Anchoring

Abstract

Deploying Vision-Language-Action (VLA) models on resource-constrained edge platforms encounters a fundamental conflict between high-latency semantic inference and the high-frequency control required for dynamic manipulation. To address the challenge, this paper presents Agile-VLA, a hierarchical framework designed for industrial pose reorientation tasks on edge devices such as the NVIDIA Jetson Orin Nano. The core innovation is an Implicit Affordance Anchoring mechanism that directly maps geometric visual cues, specifically centroid and rim keypoint anchors, into structured parametric action primitives, thereby substantially reducing reliance on high-latency semantic inference during closed-loop control. By decoupling perception (10 Hz) from control (50 Hz) via an asynchronous dual-stream architecture, the system effectively mitigates the frequency mismatch inherent in edge-based robot learning. Experimental results on a standard 6-DoF manipulator demonstrate that Agile-VLA achieves robust rectification of complex, irregular workpieces using only 5-shot demonstrations through extrinsic dexterity.
Paper Structure (18 sections, 9 equations, 5 figures, 4 tables)

This paper contains 18 sections, 9 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Visualization of Agile-VLA task primitives across diverse industrial workpieces. The top row illustrates Stable Pick and Place operations guided by Stability Anchoring, where the grasp point is optimized toward the geometric centroid to minimize gravitational torque. The bottom row demonstrates Pick, Flip and Place sequences leveraging Pivot Anchoring; here, the anchor point is constrained to the object boundary to maximize the lever arm for extrinsic dexterity. The framework is validated on four irregular objects: (a) Battery, (b) Calculator, (c) Phone, and (d) Chip.
  • Figure 2: System Overview. Compared to conventional large-scale VLA models that suffer from high cloud latency, Agile-VLA achieves real-time, few-shot pose reorientation directly on an edge device (NVIDIA Jetson Orin Nano). By decoupling high-level semantic reasoning from low-level control, the framework enables agile manipulation of complex industrial components using only a minimal number of demonstration samples.
  • Figure 3: Asynchronous Dual-Stream Framework. The semantic perception stream (10 Hz) is responsible for extracting manipulation anchors, while the proprioceptive control stream (50 Hz) executes parameterized action primitives. A timestamp-based soft synchronization protocol ensures strict alignment between the visual and kinesthetic streams, thereby enabling high-fidelity few-shot adaptation on resource-constrained hardware.
  • Figure 4: Implicit Affordance Anchoring Protocol. The system maps visual cues to specific action primitives according to the object’s topological state. Front-facing objects are anchored at their geometric centroid to trigger stable, guidance-oriented grasping, whereas back-facing objects are anchored at the rim to initiate extrinsic-dexterity-based flipping, thereby effectively leveraging environmental constraints to overcome kinematic limitations.
  • Figure 5: Real-world multi-task reorientation performance across extreme physical conditions. We systematically benchmark Agile-VLA against state-of-the-art foundation models on six extreme industrial parts (involving heavy masses and complex topologies) and two novel out-of-distribution objects. Crucially, the performance on 'Unseen' objects reflects the results after our 5-shot edge adaptation pipeline, demonstrating the framework's rapid deployment capability in contrast to the massive fine-tuning required by large VLA baselines. By leveraging extrinsic pivoting, Agile-VLA successfully bypasses actuator torque overload (e.g., Heavy Battery) and kinematic singularities (e.g., Complex PCB), achieving consistently higher success rates.