Efficient Reinforcement Learning of Task Planners for Robotic Palletization through Iterative Action Masking Learning

Zheng Wu; Yichuan Li; Wei Zhan; Changliu Liu; Yun-Hui Liu; Masayoshi Tomizuka

Efficient Reinforcement Learning of Task Planners for Robotic Palletization through Iterative Action Masking Learning

Zheng Wu, Yichuan Li, Wei Zhan, Changliu Liu, Yun-Hui Liu, Masayoshi Tomizuka

TL;DR

This work tackles robotic palletization as an online 3D Bin Packing Problem with a buffer, where a large combinatorial action space hinders efficient RL training. It introduces a supervised action-masking pipeline that learns to predict valid, stable placements using a semantic-segmentation framework (U‑Net) and a DAgger-like iterative refinement to align masking with the RL distribution. Empirical results show that LearnedMask markedly improves stability-aware action pruning (IoU 89.2% vs 76.6% for heuristics) and accelerates RL learning, achieving higher pallet utilization across buffer sizes and enabling real-world deployment with a Franka Panda prototype (72.0% space usage). The work highlights a practical, data-efficient path to robust RL-based task planning in complex, high-dimensional robotics, while acknowledging the need to integrate trajectory planning for fully collision-free execution.

Abstract

The development of robotic systems for palletization in logistics scenarios is of paramount importance, addressing critical efficiency and precision demands in supply chain management. This paper investigates the application of Reinforcement Learning (RL) in enhancing task planning for such robotic systems. Confronted with the substantial challenge of a vast action space, which is a significant impediment to efficiently apply out-of-the-shelf RL methods, our study introduces a novel method of utilizing supervised learning to iteratively prune and manage the action space effectively. By reducing the complexity of the action space, our approach not only accelerates the learning phase but also ensures the effectiveness and reliability of the task planning in robotic palletization. The experimental results underscore the efficacy of this method, highlighting its potential in improving the performance of RL applications in complex and high-dimensional environments like logistics palletization.

Efficient Reinforcement Learning of Task Planners for Robotic Palletization through Iterative Action Masking Learning

TL;DR

Abstract

Paper Structure (26 sections, 1 equation, 8 figures, 1 table, 1 algorithm)

This paper contains 26 sections, 1 equation, 8 figures, 1 table, 1 algorithm.

INTRODUCTION
RELATED WORK
Offline 3D BPP
Online 3D BPP
OUR PROPOSED APPROACH
Problem Formulation
State
Action
Reward
Action Space Masking via Supervised Learning
Data Collection
Learning the Action Masking Model
Embedding the Learned Action Masking Model into RL Training
Iterative Action Masking for RL Training
EXPERIMENTAL VALIDATIONS
...and 11 more sections

Figures (8)

Figure 1: A typical robotic palletization system is composed of various modules, including perception, task planner, trajectory planner, controller, etc. Task planning is the main focus of our work.
Figure 2: An overview of our action masking learning process. The methodology unfolds in three phases: data collection, for gathering relevant training data; learning the action masking model, where a U-net architecture learns to distinguish stable from unstable placements; and embedding the learned action masking model into RL training, integrating the model to dynamically reduce the action space and enhance RL optimization.
Figure 3: Visualization of our simulated palletization environment in MuJoCo todorov2012mujoco. Although 80 boxes are displayed for illustrative purposes, the robot is programmed to perceive and interact with only $N$ boxes within the buffer area. The arrangement of the boxes is randomized and unknown, shuffled anew for each RL episode.
Figure 4: The policy network architecture adopted in our study. We use a CNN to encode the height map and a MLP to encode the dimensions of forthcoming boxes. The resulting embeddings are concatenated and serve as the input to the policy network. The action masking model, if it exists, helps the policy network ignore infeasible actions during learning.
Figure 5: Learning curve of the three methods when buffer size $N=1$. Results are averaged over 5 random seeds. Our method (LearnedMask) converges faster and achieves better space utilization compared to the baseline methods.
...and 3 more figures

Efficient Reinforcement Learning of Task Planners for Robotic Palletization through Iterative Action Masking Learning

TL;DR

Abstract

Efficient Reinforcement Learning of Task Planners for Robotic Palletization through Iterative Action Masking Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (8)