Table of Contents
Fetching ...

Physics-Aware Robotic Palletization with Online Masking Inference

Tianqi Zhang, Zheng Wu, Yuxin Chen, Yixiao Wang, Boyuan Liang, Scott Moura, Masayoshi Tomizuka, Mingyu Ding, Wei Zhan

TL;DR

The paper tackles online palletization by incorporating intrinsic box properties—density and rigidity—into a 3D bin packing framework. It introduces an online learning-based action masking model integrated with a reinforcement-learning policy, eliminating reliance on hand-crafted heuristics and offline data. The method uses a 3D-UNet to produce a feasible-placement mask from stereo-like inputs of the pallet and upcoming boxes, with a reward that favors space utilization under stability and an online dataset managed via a deque. Experimental validation in MuJoCo and real-world robotic palletizing demonstrates improved convergence, higher space utilization, and effective transfer to physical deployment. The approach offers a practical, generalizable solution for density/rigidity-aware online stacking in real warehouses.

Abstract

The efficient planning of stacking boxes, especially in the online setting where the sequence of item arrivals is unpredictable, remains a critical challenge in modern warehouse and logistics management. Existing solutions often address box size variations, but overlook their intrinsic and physical properties, such as density and rigidity, which are crucial for real-world applications. We use reinforcement learning (RL) to solve this problem by employing action space masking to direct the RL policy toward valid actions. Unlike previous methods that rely on heuristic stability assessments which are difficult to assess in physical scenarios, our framework utilizes online learning to dynamically train the action space mask, eliminating the need for manual heuristic design. Extensive experiments demonstrate that our proposed method outperforms existing state-of-the-arts. Furthermore, we deploy our learned task planner in a real-world robotic palletizer, validating its practical applicability in operational settings.

Physics-Aware Robotic Palletization with Online Masking Inference

TL;DR

The paper tackles online palletization by incorporating intrinsic box properties—density and rigidity—into a 3D bin packing framework. It introduces an online learning-based action masking model integrated with a reinforcement-learning policy, eliminating reliance on hand-crafted heuristics and offline data. The method uses a 3D-UNet to produce a feasible-placement mask from stereo-like inputs of the pallet and upcoming boxes, with a reward that favors space utilization under stability and an online dataset managed via a deque. Experimental validation in MuJoCo and real-world robotic palletizing demonstrates improved convergence, higher space utilization, and effective transfer to physical deployment. The approach offers a practical, generalizable solution for density/rigidity-aware online stacking in real warehouses.

Abstract

The efficient planning of stacking boxes, especially in the online setting where the sequence of item arrivals is unpredictable, remains a critical challenge in modern warehouse and logistics management. Existing solutions often address box size variations, but overlook their intrinsic and physical properties, such as density and rigidity, which are crucial for real-world applications. We use reinforcement learning (RL) to solve this problem by employing action space masking to direct the RL policy toward valid actions. Unlike previous methods that rely on heuristic stability assessments which are difficult to assess in physical scenarios, our framework utilizes online learning to dynamically train the action space mask, eliminating the need for manual heuristic design. Extensive experiments demonstrate that our proposed method outperforms existing state-of-the-arts. Furthermore, we deploy our learned task planner in a real-world robotic palletizer, validating its practical applicability in operational settings.

Paper Structure

This paper contains 11 sections, 1 equation, 7 figures, 2 tables.

Figures (7)

  • Figure 1: The whole framework of our proposed method. The entire framework can be divided into three parts. The first part is feature extraction. At each timestep t, given the state $s_t = \{C_t, d_t\}$, we process the pallet configuration $C_t$ using a 3D CNN and the properties of the boxes $d_t$ in the buffer using an MLP. These two components are then concatenated to form the observation feature. The second part is interaction. The latter half of the policy network outputs an action based on the observation feature. We use a 3D-UNet cciccek20163d as the action masking model. However, we have added a convolutional layer at the end to transform the multi-channel 3D array input into a single-channel 2D map. It generates an action space mask that filters out unstable box placement points, by using the selected box’s properties and the current pallet configuration. The action is then executed in the environment, and the result is obtained. The third part is the training of the action masking model. Unlike the first two parts, which are executed at each timestep, this part only occurs when the policy is updated. First, multiple parallel simulators are used to generate the corresponding feasible annotations for the data selected during the rollout. These data points and their annotations are then appended to the feasible dataset, which is subsequently used to train the action masking model over several epochs.
  • Figure 2: Visualization of our simulated palletization environment in MuJoCo todorov2012mujoco. Although 40 boxes are displayed for illustrative purposes, the robot is programmed to perceive and interact with only N boxes within the buffer area. The arrangement of the boxes is randomized and unknown, shuffled anew for each RL episode.
  • Figure 3: IoU score on validation set. After rolling out for a certain number of timesteps, we update the action masking model and calculate the IoU score on the validation set. Results are averaged over 5 random seeds.
  • Figure 4: The frequency change curve for the three types of episode end reasons. During training, we record the frequency of each episode end reason, which is represented by the curve shown. Results are averaged over 5 random seeds.
  • Figure 5: Space utilization. Our method (OL Mask) demonstrates faster convergence and achieves better space utilization compared to the other methods. Results are averaged over 5 random seeds.
  • ...and 2 more figures