Human Centric General Physical Intelligence for Agile Manufacturing Automation
Sandeep Kanta, Mehrdad Tavassoli, Varun Teja Chirkuri, Venkata Akhil Kumar, Santhi Bharath Punati, Praveen Damacharla, Sunny Katyara
TL;DR
The paper investigates how Vision-Language-Action foundation models can underpin General Physical Intelligence for agile, human-centered manufacturing. It surveys state-of-the-art frameworks, organizes them into six thematic pillars, and presents ablations (notably RT2-GPI) across nut-and-bolt and timber-panel tasks to illuminate trade-offs between generalization, accuracy, and speed. Key contributions include architectural modifications to existing baselines, haptic-grounding fusion strategies, and a structured discussion of data, sim-to-real, planning, safety, and benchmarking with industry-ready recommendations. The findings highlight substantial progress toward integrated perception-reasoning-action pipelines, while also underscoring persistent challenges in data foundations, real-time safety, long-horizon control, and resilience, which must be addressed to achieve practical, Industry 5.0 deployment.
Abstract
Agile human-centric manufacturing increasingly requires resilient robotic solutions that are capable of safe and productive interactions within unstructured environments of modern factories. While multi-modal sensor fusion provides comprehensive situational awareness yet robots must also contextualize their reasoning to achieve deep semantic understanding of complex scenes. Foundation model particularly Vision-Language-Action (VLA) models have emerged as promising approach on integrating diverse perceptual modalities and spatio-temporal reasoning abilities to ground physical actions to realize General Physical Intelligence (GPI) across various robotic embodiments. Although GPI has been conceptually discussed in literature but its pivotal role and practical deployment in agile manufacturing remain underexplored. To address this gap, this practical review systematically surveys recent advances in VLA models through the lens of GPI by offering comparative analysis of leading implementations and evaluating their industrial readiness via structured ablation study. The state of the art is organized into six thematic pillars including multisensory representation learning, sim2real transfer, planning and control, uncertainty and safety measures and benchmarking. Finally, the review highlights open challenges and future directions for integrating GPI into industrial ecosystems to align with the vision of Industry 5.0 for intelligent, adaptive and collaborative manufacturing ecosystem.
