Causal Policy Gradient for Whole-Body Mobile Manipulation

Jiaheng Hu; Peter Stone; Roberto Martín-Martín

Causal Policy Gradient for Whole-Body Mobile Manipulation

Jiaheng Hu, Peter Stone, Roberto Martín-Martín

TL;DR

The paper tackles mobile manipulation (MoMa) by addressing the large, coupled action space and multi-objective rewards. It introduces Causal MoMa, a two-step framework that automatically discovers causal links between action dimensions and reward terms via conditional mutual information and then trains policies with a causal policy gradient, reducing gradient variance. The approach enables simultaneous use of base, arm, and head actions and demonstrates strong improvements over baselines in Minigrid, realistic simulators, and real-world transfers, including zero-shot sim-to-real. Limitations include handling long-horizon dependencies and end-to-end vision, with future work proposed on hierarchical causal discovery and RGB integration. Overall, Causal MoMa offers a principled, data-driven way to factor action spaces in MoMa tasks, improving training stability and real-world applicability.

Abstract

Developing the next generation of household robot helpers requires combining locomotion and interaction capabilities, which is generally referred to as mobile manipulation (MoMa). MoMa tasks are difficult due to the large action space of the robot and the common multi-objective nature of the task, e.g., efficiently reaching a goal while avoiding obstacles. Current approaches often segregate tasks into navigation without manipulation and stationary manipulation without locomotion by manually matching parts of the action space to MoMa sub-objectives (e.g. learning base actions for locomotion objectives and learning arm actions for manipulation). This solution prevents simultaneous combinations of locomotion and interaction degrees of freedom and requires human domain knowledge for both partitioning the action space and matching the action parts to the sub-objectives. In this paper, we introduce Causal MoMa, a new reinforcement learning framework to train policies for typical MoMa tasks that makes use of the most favorable subspace of the robot's action space to address each sub-objective. Causal MoMa automatically discovers the causal dependencies between actions and terms of the reward function and exploits these dependencies through causal policy gradient that reduces gradient variance compared to previous state-of-the-art reinforcement learning algorithms, improving convergence and results. We evaluate the performance of Causal MoMa on three types of simulated robots across different MoMa tasks and demonstrate success in transferring the policies trained in simulation directly to a real robot, where our agent is able to follow moving goals and react to dynamic obstacles while simultaneously and synergistically controlling the whole-body: base, arm, and head. More information at https://sites.google.com/view/causal-moma.

Causal Policy Gradient for Whole-Body Mobile Manipulation

TL;DR

Abstract

Paper Structure (13 sections, 4 theorems, 9 equations, 6 figures, 1 table)

This paper contains 13 sections, 4 theorems, 9 equations, 6 figures, 1 table.

Introduction
Related Work
Causal MoMa
Action-Reward Causal Discovery
Policy Learning
Experimental Evaluation
Evaluation in the Minigrid Simulator
Evaluation in Realistic Robot Simulators
Evaluation on a Real-World Mobile Manipulator
Limitations and Conclusion
Proof of Theorem \ref{['thm']} (Causal Sufficiency and Necessity)
Proof Sketch of Theorem \ref{['thm_fpg']} (Causal Policy Gradient)
Minigrid Experimental Details

Key Result

Theorem 3.1

Let $\{s, \mathbf{a} \backslash a_i\}$ denote the conditioning set $\{s, a_1, a_2, \ldots, a_{i-1}, a_{i+1}, \ldots, a_n \}$. Let $\mathbf{a}^{t:t+k}$ denote a $n \times k$ dimensional matrix representing k-step actions from timestep $t$ to timestep $t+k-1$. Let $\mathbf{r}^{\sum{t:t+k}}$ denote a $

Figures (6)

Figure 1: Robot executing a mobile manipulation task: placing a jug on a table. The task is naturally defined by multiple objectives corresponding to a factored reward function with multiple components (red): reaching the placing location, keeping the orientation upright, looking at the goal, and avoiding collisions with the base. Only some subsets of the degrees of freedom of the robot (green) are necessary to fulfill each objective. This corresponds to causal dependencies between some action space dimensions and reward terms (top-right). Causal MoMa infers these underlying causal relationships and exploits them in a causal policy gradient approach that enables learning policies for complex mobile manipulation tasks.
Figure 2: Two-step procedure in Causal MoMa for policy training in MoMa tasks with factored reward functions without a priori known action-space factorization. Top: Causal MoMa infers the causal dependencies existing between reward terms and action dimensions through a causal discovery procedure on randomly collected data: estimating and thresholding the conditional-mutual information (CMI) between action dimensions and reward factors to infer the Causal Matrix, $B$. Bottom: Causal MoMa trains a policy that generates whole-body action commands based on onboard sensor signals and task information. For that, Causal MoMa exploits the discovered Causal Matrix through causal policy gradient: advantages per reward term are aggregated into advantages for the causally related action dimension and used to update the policy, greatly reducing policy gradient variance.
Figure 3: Experimental evaluation of Causal MoMa on the Minigrid minigrid domain: (Left) the agent controls an embodiment (red triangle) with discrete actions for navigation and manipulation with the goal of reaching a goal location (green tile). Blue tiles and green tiles require the agent to perform specific virtual manipulation actions. Orange tiles should be avoided by the agent. (Right) training curves for Causal MoMa and baselines, five seeds each, mean and std: Causal MoMa converges to the highest reward thanks to the discovery and exploitation of the causal dependencies between actions and reward terms.
Figure 4: Experimental evaluation of Causal MoMa on the iGibson li2022igibson domain: (Left) the agent is placed in one of eight possible household scenes and controls one of two realistically simulated mobile manipulation embodiments, a Fetch or an HSR robot, with continuous action dimensions and different dexterity (7 vs. 5 degrees of freedom in the arm, non-holonomic vs. holonomic base) for a virtual place glass task: reaching a desired location with the hand while keeping a fixed hand orientation and avoiding collisions. Obstacles and robot initial locations are randomized per episode. (Middle and Right) training curves for Causal MoMa and baselines for Fetch (middle) and HSR (right) embodiments, five seeds each, mean and std. In this complex setup, Causal MoMa consistently outperforms the baselines and achieves a higher return thanks to a reduced gradient variance with the causal policy gradient.
Figure 5: Evaluation environment for Causal MoMa in the real-world. Left: The robot is placed in a mock apartment never seen during training and the best Causal MoMa trained policy is transferred zero-shot. Right: the robot is tasked with reaching different locations with the end-effector (crosses) from varying starting points (circles) while keeping a desired orientation, avoiding collisions, and keeping the goal in sight. We evaluate paths with three types of obstacles, no obstacles (red), static obstacles (green), and dynamic obstacles (blue) in the direct path to the goal, frequent in household environments. Each setup repeats for two types of goals, static and dynamic. First cross: robot's initial end-effector goal (and final for static goals); Second cross: robot's final end-effector goal, when the goal is dynamic. The policy trained with Causal MoMa achieves higher performance than a planning-based solution (with and without replanning) with privileged information about the layout of the scene.
...and 1 more figures

Theorems & Definitions (4)

Theorem 3.1
Theorem 3.2
Lemma A.1
Proposition B.2

Causal Policy Gradient for Whole-Body Mobile Manipulation

TL;DR

Abstract

Causal Policy Gradient for Whole-Body Mobile Manipulation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (4)