Table of Contents
Fetching ...

GOPT: Generalizable Online 3D Bin Packing via Transformer-based Deep Reinforcement Learning

Heng Xiong, Changrong Guo, Jian Peng, Kai Ding, Wenjie Chen, Xuchong Qiu, Long Bai, Jianfeng Xu

TL;DR

GOPT tackles online 3D bin packing under variable bin sizes by framing packing as an MDP and introducing a Placement Generator that yields a fixed set of placement candidates alongside a Packing Transformer that learns spatial relations between items and candidate spaces. Trained with PPO, GOPT achieves superior space utilization and packing counts while generalizing across unseen bin dimensions and items, demonstrated both in simulation and on a real robotic system. The Placement Generator constrains the action space to 2N options, independent of bin size, while the Packing Transformer uses cross-attention to fuse item and space features, enabling robust generalization. This work advances practical robotic packing by delivering a scalable, generalizable strategy applicable to diverse logistics scenarios, with future work targeting irregular shapes and improved real-world reliability.

Abstract

Robotic object packing has broad practical applications in the logistics and automation industry, often formulated by researchers as the online 3D Bin Packing Problem (3D-BPP). However, existing DRL-based methods primarily focus on enhancing performance in limited packing environments while neglecting the ability to generalize across multiple environments characterized by different bin dimensions. To this end, we propose GOPT, a generalizable online 3D Bin Packing approach via Transformer-based deep reinforcement learning (DRL). First, we design a Placement Generator module to yield finite subspaces as placement candidates and the representation of the bin. Second, we propose a Packing Transformer, which fuses the features of the items and bin, to identify the spatial correlation between the item to be packed and available sub-spaces within the bin. Coupling these two components enables GOPT's ability to perform inference on bins of varying dimensions. We conduct extensive experiments and demonstrate that GOPT not only achieves superior performance against the baselines, but also exhibits excellent generalization capabilities. Furthermore, the deployment with a robot showcases the practical applicability of our method in the real world. The source code will be publicly available at https://github.com/Xiong5Heng/GOPT.

GOPT: Generalizable Online 3D Bin Packing via Transformer-based Deep Reinforcement Learning

TL;DR

GOPT tackles online 3D bin packing under variable bin sizes by framing packing as an MDP and introducing a Placement Generator that yields a fixed set of placement candidates alongside a Packing Transformer that learns spatial relations between items and candidate spaces. Trained with PPO, GOPT achieves superior space utilization and packing counts while generalizing across unseen bin dimensions and items, demonstrated both in simulation and on a real robotic system. The Placement Generator constrains the action space to 2N options, independent of bin size, while the Packing Transformer uses cross-attention to fuse item and space features, enabling robust generalization. This work advances practical robotic packing by delivering a scalable, generalizable strategy applicable to diverse logistics scenarios, with future work targeting irregular shapes and improved real-world reliability.

Abstract

Robotic object packing has broad practical applications in the logistics and automation industry, often formulated by researchers as the online 3D Bin Packing Problem (3D-BPP). However, existing DRL-based methods primarily focus on enhancing performance in limited packing environments while neglecting the ability to generalize across multiple environments characterized by different bin dimensions. To this end, we propose GOPT, a generalizable online 3D Bin Packing approach via Transformer-based deep reinforcement learning (DRL). First, we design a Placement Generator module to yield finite subspaces as placement candidates and the representation of the bin. Second, we propose a Packing Transformer, which fuses the features of the items and bin, to identify the spatial correlation between the item to be packed and available sub-spaces within the bin. Coupling these two components enables GOPT's ability to perform inference on bins of varying dimensions. We conduct extensive experiments and demonstrate that GOPT not only achieves superior performance against the baselines, but also exhibits excellent generalization capabilities. Furthermore, the deployment with a robot showcases the practical applicability of our method in the real world. The source code will be publicly available at https://github.com/Xiong5Heng/GOPT.
Paper Structure (19 sections, 2 equations, 6 figures, 4 tables)

This paper contains 19 sections, 2 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Robot picking and packing pipeline. Left: A robot randomly picks an item from a cluttered collection of boxes and packs it in a compact manner, and three RGB-D cameras are mounted. Right: Two overhead cameras observe the status of the two bins, respectively, and one up-looking camera estimates the dimension of the picked item.
  • Figure 2: Overview of our method. (a) In the GOPT, the inputs comprise the item to be packed and the current heightmap of the bin, wherein each cell's value represents the respective height. Utilizing the Placement Generator, a set of EMSs is produced, along with a pairwise action mask between each EMS and the optional orientation of the item. After that, we separately encode the EMSs and the item and then fuse the features using the Packing Transformer, of which outputs are fed into the actor and critic networks to generate logits of all actions and estimate the state-value function; (b) depicts the details of the proposed Packing Transformer. The transformer comprises three stacked blocks, each containing two self-attention and two cross-attention layers.
  • Figure 3: Illustration of the EMS generation procedure. (a) In an example scene with two placed items, the heightmap indicates the current height of stacked items in each grid cell; (b) Five corner points (black dots) are detected at this heightmap; (c) Based on these points, the corresponding largest inscribed rectangles (blue) within the bin are generated, namely EMSs. Taking the first EMS as an example, it is defined by two red vertices of the blue rectangles.
  • Figure 4: Visualization results of different methods for an item sequence in a $10\times10\times10$ bin. The number beside each bin indicates the value of Uti.
  • Figure 5: Comparison of the training performance for the ablation studies. The results are obtained with 128 different random seeds.
  • ...and 1 more figures