Table of Contents
Fetching ...

Learning Neuro-symbolic Programs for Language Guided Robot Manipulation

Namasivayam Kalithasan, Himanshu Singh, Vishal Bindal, Arnav Tuli, Vishwajeet Agrawal, Rahul Jain, Parag Singla, Rohan Paul

TL;DR

The paper tackles language-guided robot manipulation by translating natural language instructions into executable manipulation programs grounded in a robot’s state. It introduces a neuro-symbolic framework with a domain-specific language that is executed over a latent, object-centric scene representation, enabling end-to-end training with only initial and final scene supervision. The architecture fuses a Language Reasoner, Visual Extractor, Visual Reasoner, and Action Simulator, trained with REINFORCE for the linguistic component and supervised losses for perception and action prediction, while also providing scene-reconstruction for interpretability. Empirical results in a PyBullet 7-DOF setting show strong generalization to novel scenes and longer instructions, outperforming neural baselines and CLIP-based approaches, and demonstrating viable simulated robot demonstrations. These findings highlight the potential of symbolic reasoning embedded in neural representations to enhance robustness and interpretability in language-guided manipulation.

Abstract

Given a natural language instruction and an input scene, our goal is to train a model to output a manipulation program that can be executed by the robot. Prior approaches for this task possess one of the following limitations: (i) rely on hand-coded symbols for concepts limiting generalization beyond those seen during training [1] (ii) infer action sequences from instructions but require dense sub-goal supervision [2] or (iii) lack semantics required for deeper object-centric reasoning inherent in interpreting complex instructions [3]. In contrast, our approach can handle linguistic as well as perceptual variations, end-to-end trainable and requires no intermediate supervision. The proposed model uses symbolic reasoning constructs that operate on a latent neural object-centric representation, allowing for deeper reasoning over the input scene. Central to our approach is a modular structure consisting of a hierarchical instruction parser and an action simulator to learn disentangled action representations. Our experiments on a simulated environment with a 7-DOF manipulator, consisting of instructions with varying number of steps and scenes with different number of objects, demonstrate that our model is robust to such variations and significantly outperforms baselines, particularly in the generalization settings. The code, dataset and experiment videos are available at https://nsrmp.github.io

Learning Neuro-symbolic Programs for Language Guided Robot Manipulation

TL;DR

The paper tackles language-guided robot manipulation by translating natural language instructions into executable manipulation programs grounded in a robot’s state. It introduces a neuro-symbolic framework with a domain-specific language that is executed over a latent, object-centric scene representation, enabling end-to-end training with only initial and final scene supervision. The architecture fuses a Language Reasoner, Visual Extractor, Visual Reasoner, and Action Simulator, trained with REINFORCE for the linguistic component and supervised losses for perception and action prediction, while also providing scene-reconstruction for interpretability. Empirical results in a PyBullet 7-DOF setting show strong generalization to novel scenes and longer instructions, outperforming neural baselines and CLIP-based approaches, and demonstrating viable simulated robot demonstrations. These findings highlight the potential of symbolic reasoning embedded in neural representations to enhance robustness and interpretability in language-guided manipulation.

Abstract

Given a natural language instruction and an input scene, our goal is to train a model to output a manipulation program that can be executed by the robot. Prior approaches for this task possess one of the following limitations: (i) rely on hand-coded symbols for concepts limiting generalization beyond those seen during training [1] (ii) infer action sequences from instructions but require dense sub-goal supervision [2] or (iii) lack semantics required for deeper object-centric reasoning inherent in interpreting complex instructions [3]. In contrast, our approach can handle linguistic as well as perceptual variations, end-to-end trainable and requires no intermediate supervision. The proposed model uses symbolic reasoning constructs that operate on a latent neural object-centric representation, allowing for deeper reasoning over the input scene. Central to our approach is a modular structure consisting of a hierarchical instruction parser and an action simulator to learn disentangled action representations. Our experiments on a simulated environment with a 7-DOF manipulator, consisting of instructions with varying number of steps and scenes with different number of objects, demonstrate that our model is robust to such variations and significantly outperforms baselines, particularly in the generalization settings. The code, dataset and experiment videos are available at https://nsrmp.github.io
Paper Structure (17 sections, 6 figures, 3 tables)

This paper contains 17 sections, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Model architecture. The Visual Extractor forms dense object representations from the scene image using pre-trained object detector and feature extractor. The Language Reasoner auto-regressively induces a symbolic program from the instruction that represents rich symbolic reasoning over spatial and action constructs inherent in the instruction. The Visual Reasoner determines which objects are affected by actions in the plan using symbolic and spatial reasoning. The Action Simulator predicts final location of the moved object. The model is trained end-to-end with a loss on the bounding boxes, backpropagated to action and visual modules. REINFORCE is used to train williams1987class the language reasoner from which symbolic programs are sampled.
  • Figure 2: Quasi-symbolic program execution. We postulate a latent space of hierarchical, symbolic programs that performs explicit reasoning over action, spatial and visual concepts. a) The language reasoner infers a program belonging to this space, capturing the semantics of the language instruction. The program is executed over the latent object representations extracted from the initial scene to get a grounded program. b) The grounded program is used by the action simulator to compute the final location of the moved object. This is fed into a low-level motion planner for computing the low-level trajectory.
  • Figure 3: Object-centric baseline NMN+. For a fair comparison, our model and NMN+ share the same language encoder (with LSTM-based splitter) and the visual extractor. Attention blocks compute language-guided attention over object embeddings to get the subject and predicate for the manipulation action. This is fed into the action simulator along with the action embedding from the action decoder to get the predicted final location of the object.
  • Figure 4: Performance in generalization settings
  • Figure 5: Execution of robot manipulator on (a) compound instructions, (b) scene with 15 objects, (c) double step instruction with relational attributes, (d) 5-step instruction. (d) also shows reconstruction of the predicted scene before each step of the simulation
  • ...and 1 more figures