Table of Contents
Fetching ...

Human Centric General Physical Intelligence for Agile Manufacturing Automation

Sandeep Kanta, Mehrdad Tavassoli, Varun Teja Chirkuri, Venkata Akhil Kumar, Santhi Bharath Punati, Praveen Damacharla, Sunny Katyara

TL;DR

The paper investigates how Vision-Language-Action foundation models can underpin General Physical Intelligence for agile, human-centered manufacturing. It surveys state-of-the-art frameworks, organizes them into six thematic pillars, and presents ablations (notably RT2-GPI) across nut-and-bolt and timber-panel tasks to illuminate trade-offs between generalization, accuracy, and speed. Key contributions include architectural modifications to existing baselines, haptic-grounding fusion strategies, and a structured discussion of data, sim-to-real, planning, safety, and benchmarking with industry-ready recommendations. The findings highlight substantial progress toward integrated perception-reasoning-action pipelines, while also underscoring persistent challenges in data foundations, real-time safety, long-horizon control, and resilience, which must be addressed to achieve practical, Industry 5.0 deployment.

Abstract

Agile human-centric manufacturing increasingly requires resilient robotic solutions that are capable of safe and productive interactions within unstructured environments of modern factories. While multi-modal sensor fusion provides comprehensive situational awareness yet robots must also contextualize their reasoning to achieve deep semantic understanding of complex scenes. Foundation model particularly Vision-Language-Action (VLA) models have emerged as promising approach on integrating diverse perceptual modalities and spatio-temporal reasoning abilities to ground physical actions to realize General Physical Intelligence (GPI) across various robotic embodiments. Although GPI has been conceptually discussed in literature but its pivotal role and practical deployment in agile manufacturing remain underexplored. To address this gap, this practical review systematically surveys recent advances in VLA models through the lens of GPI by offering comparative analysis of leading implementations and evaluating their industrial readiness via structured ablation study. The state of the art is organized into six thematic pillars including multisensory representation learning, sim2real transfer, planning and control, uncertainty and safety measures and benchmarking. Finally, the review highlights open challenges and future directions for integrating GPI into industrial ecosystems to align with the vision of Industry 5.0 for intelligent, adaptive and collaborative manufacturing ecosystem.

Human Centric General Physical Intelligence for Agile Manufacturing Automation

TL;DR

The paper investigates how Vision-Language-Action foundation models can underpin General Physical Intelligence for agile, human-centered manufacturing. It surveys state-of-the-art frameworks, organizes them into six thematic pillars, and presents ablations (notably RT2-GPI) across nut-and-bolt and timber-panel tasks to illuminate trade-offs between generalization, accuracy, and speed. Key contributions include architectural modifications to existing baselines, haptic-grounding fusion strategies, and a structured discussion of data, sim-to-real, planning, safety, and benchmarking with industry-ready recommendations. The findings highlight substantial progress toward integrated perception-reasoning-action pipelines, while also underscoring persistent challenges in data foundations, real-time safety, long-horizon control, and resilience, which must be addressed to achieve practical, Industry 5.0 deployment.

Abstract

Agile human-centric manufacturing increasingly requires resilient robotic solutions that are capable of safe and productive interactions within unstructured environments of modern factories. While multi-modal sensor fusion provides comprehensive situational awareness yet robots must also contextualize their reasoning to achieve deep semantic understanding of complex scenes. Foundation model particularly Vision-Language-Action (VLA) models have emerged as promising approach on integrating diverse perceptual modalities and spatio-temporal reasoning abilities to ground physical actions to realize General Physical Intelligence (GPI) across various robotic embodiments. Although GPI has been conceptually discussed in literature but its pivotal role and practical deployment in agile manufacturing remain underexplored. To address this gap, this practical review systematically surveys recent advances in VLA models through the lens of GPI by offering comparative analysis of leading implementations and evaluating their industrial readiness via structured ablation study. The state of the art is organized into six thematic pillars including multisensory representation learning, sim2real transfer, planning and control, uncertainty and safety measures and benchmarking. Finally, the review highlights open challenges and future directions for integrating GPI into industrial ecosystems to align with the vision of Industry 5.0 for intelligent, adaptive and collaborative manufacturing ecosystem.

Paper Structure

This paper contains 23 sections, 2 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Conceptual representation of human-inspired general physical intelligence in the context of industrial automation.
  • Figure 2: Graphical synopsis summarizing transformation of Vision-Language-Action models into General Physical Intelligence systems to advance agile manufacturing through multimodal perception and control for embodied industrial applications. Emphasis is on contact-rich interaction, transfer learning and semantic grounding with focus on development of robotics foundation models and cognitive datasets to enable safe and context-aware industrial robotic systems aligned with Industry 5.0.
  • Figure 3: Architectural representation of general physical intelligence in the context of agile manufacturing.
  • Figure 4: Franka Panda robot performing precision nut and bolt assembly using vision, language, haptic feedback and proprioception sensing for dynamic action planning in NVIDIA Isaac Sim.
  • Figure 5: Dexterous timber panel manipulation task using KUKA manipulators mounted on linear tracks to perform coordinated grasping, lifting and reorientation of elongated timber panels. The scenario evaluates multimodal grounding, cooperative manipulation and spatial generalization under randomized panel and robot initial configurations.