Physically Ground Commonsense Knowledge for Articulated Object Manipulation with Analytic Concepts
Jianhua Sun, Jiude Wei, Yuxuan Li, Cewu Lu
TL;DR
This work addresses grounding semantic commonsense from large language models into the physical world for articulated-object manipulation by introducing analytic concepts—unique identities with parameterized analytic structural and manipulation knowledge. The approach builds a three-stage pipeline (targeting, grounding structure, grounding manipulation) to convert LLM reasoning into physics-informed robot actions, using tools like Grounded-SAM, Point-Transformer encoders, and a conditional GAN for grasp proposals. Across extensive simulation and real-world experiments, the method significantly outperforms baselines, with substantial improvements in success rates on both seen and unseen object categories, validating better generalization and interpretability. The work offers a practical, scalable bridge between semantic reasoning and physical control, enabling more robust and generalizable articulated-object manipulation in real-world settings.
Abstract
We human rely on a wide range of commonsense knowledge to interact with an extensive number and categories of objects in the physical world. Likewise, such commonsense knowledge is also crucial for robots to successfully develop generalized object manipulation skills. While recent advancements in Large Language Models (LLM) have showcased their impressive capabilities in acquiring commonsense knowledge and conducting commonsense reasoning, effectively grounding this semantic-level knowledge produced by LLMs to the physical world to thoroughly guide robots in generalized articulated object manipulation remains a challenge that has not been sufficiently addressed. To this end, we introduce analytic concepts, procedurally defined upon mathematical symbolism that can be directly computed and simulated by machines. By leveraging the analytic concepts as a bridge between the semantic-level knowledge inferred by LLMs and the physical world where real robots operate, we are able to figure out the knowledge of object structure and functionality with physics-informed representations, and then use the physically grounded knowledge to instruct robot control policies for generalized, interpretable and accurate articulated object manipulation. Extensive experiments in both simulation and real-world environments demonstrate the superiority of our approach.
