Table of Contents
Fetching ...

Knowledge-Guided Manipulation Using Multi-Task Reinforcement Learning

Aditya Narendra, Mukhammadrizo Maribjonov, Dmitry Makarov, Dmitry Yudin, Aleksandr Panov

Abstract

This paper introduces Knowledge Graph based Massively Multi-task Model-based Policy Optimization (KG-M3PO), a framework for multi-task robotic manipulation in partially observable settings that unifies Perception, Knowledge, and Policy. The method augments egocentric vision with an online 3D scene graph that grounds open-vocabulary detections into a metric, relational representation. A dynamic-relation mechanism updates spatial, containment, and affordance edges at every step, and a graph neural encoder is trained end-to-end through the RL objective so that relational features are shaped directly by control performance. Multiple observation modalities (visual, proprioceptive, linguistic, and graph-based) are encoded into a shared latent space, upon which the RL agent operates to drive the control loop. The policy conditions on lightweight graph queries alongside visual and proprioceptive inputs, yielding a compact, semantically informed state for decision making. Experiments on a suite of manipulation tasks with occlusions, distractors, and layout shifts demonstrate consistent gains over strong baselines: the knowledge-conditioned agent achieves higher success rates, improved sample efficiency, and stronger generalization to novel objects and unseen scene configurations. These results support the premise that structured, continuously maintained world knowledge is a powerful inductive bias for scalable, generalizable manipulation: when the knowledge module participates in the RL computation graph, relational representations align with control, enabling robust long-horizon behavior under partial observability.

Knowledge-Guided Manipulation Using Multi-Task Reinforcement Learning

Abstract

This paper introduces Knowledge Graph based Massively Multi-task Model-based Policy Optimization (KG-M3PO), a framework for multi-task robotic manipulation in partially observable settings that unifies Perception, Knowledge, and Policy. The method augments egocentric vision with an online 3D scene graph that grounds open-vocabulary detections into a metric, relational representation. A dynamic-relation mechanism updates spatial, containment, and affordance edges at every step, and a graph neural encoder is trained end-to-end through the RL objective so that relational features are shaped directly by control performance. Multiple observation modalities (visual, proprioceptive, linguistic, and graph-based) are encoded into a shared latent space, upon which the RL agent operates to drive the control loop. The policy conditions on lightweight graph queries alongside visual and proprioceptive inputs, yielding a compact, semantically informed state for decision making. Experiments on a suite of manipulation tasks with occlusions, distractors, and layout shifts demonstrate consistent gains over strong baselines: the knowledge-conditioned agent achieves higher success rates, improved sample efficiency, and stronger generalization to novel objects and unseen scene configurations. These results support the premise that structured, continuously maintained world knowledge is a powerful inductive bias for scalable, generalizable manipulation: when the knowledge module participates in the RL computation graph, relational representations align with control, enabling robust long-horizon behavior under partial observability.
Paper Structure (28 sections, 8 equations, 8 figures, 5 tables)

This paper contains 28 sections, 8 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Overview of the end-to-end training pipeline for KG-M3PO (M3PO augmented with an online KG encoder trained end-to-end). Multiple observation modalities (language goal, current image, 3D scene graph, and proprioception) are encoded into a common observation space. A reinforcement learning algorithm (in our case, M3PO) then drives the control loop inside the simulation environment. The knowledge encoder (graph) is trained directly through the RL loss.
  • Figure 2: Scene Graph generation pipeline.
  • Figure 3: Example BBQ output with Franka, table, and cabinet.
  • Figure 4: Benchmark snapshots. Franka scenes for three representative tasks used in our study. Identical task variants and environments are also implemented for UR5 (not shown).
  • Figure 5: Partially observable tasks. We show UR5 examples for two PO scenarios: (a) picking an object initially hidden by a wall; (b–c) a two-stage pick–place where the agent must first remove an occluder and then retrieve the target. Identical task variants are implemented for both Franka and UR5.
  • ...and 3 more figures