Table of Contents
Fetching ...

Preference-Conditioned Reinforcement Learning for Space-Time Efficient Online 3D Bin Packing

Nikita Sarawgi, Omey M. Manyar, Fan Wang, Thinh H. Nguyen, Daniel Seita, Satyandra K. Gupta

TL;DR

The method, STEP (Space-Time Efficient Packing), uses a preference-conditioned, Transformer-based reinforcement learning policy, and allows generalization across candidate set sizes and integration with standard placement modules, and achieves a 44% reduction in operational time without compromising packing density.

Abstract

Robotic bin packing is widely deployed in warehouse automation, with current systems achieving robust performance through heuristic and learning-based strategies. These systems must balance compact placement with rapid execution, where selecting alternative items or reorienting them can improve space utilization but introduce additional time. We propose a selection-based formulation that explicitly reasons over this trade-off: at each step, the robot evaluates multiple candidate actions, weighing expected packing benefit against estimated operational time. This enables time-aware strategies that selectively accept increased operational time when it yields meaningful spatial improvements. Our method, STEP (Space-Time Efficient Packing), uses a preference-conditioned, Transformer-based reinforcement learning policy, and allows generalization across candidate set sizes and integration with standard placement modules. It achieves a 44% reduction in operational time without compromising packing density. Additional material is available at https://step-packing.github.io.

Preference-Conditioned Reinforcement Learning for Space-Time Efficient Online 3D Bin Packing

TL;DR

The method, STEP (Space-Time Efficient Packing), uses a preference-conditioned, Transformer-based reinforcement learning policy, and allows generalization across candidate set sizes and integration with standard placement modules, and achieves a 44% reduction in operational time without compromising packing density.

Abstract

Robotic bin packing is widely deployed in warehouse automation, with current systems achieving robust performance through heuristic and learning-based strategies. These systems must balance compact placement with rapid execution, where selecting alternative items or reorienting them can improve space utilization but introduce additional time. We propose a selection-based formulation that explicitly reasons over this trade-off: at each step, the robot evaluates multiple candidate actions, weighing expected packing benefit against estimated operational time. This enables time-aware strategies that selectively accept increased operational time when it yields meaningful spatial improvements. Our method, STEP (Space-Time Efficient Packing), uses a preference-conditioned, Transformer-based reinforcement learning policy, and allows generalization across candidate set sizes and integration with standard placement modules. It achieves a 44% reduction in operational time without compromising packing density. Additional material is available at https://step-packing.github.io.
Paper Structure (25 sections, 8 equations, 6 figures, 3 tables)

This paper contains 25 sections, 8 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Visualization of the semi-online 3D bin packing problem (3D-BPP) with a buffer. The robot grasps the box from the front-face over the top-face. It reorients it to a transportable configuration and places it in the bin. The selected face impacts the placement position in the bin.
  • Figure 2: Overall architecture of STEP, a preference-conditioned Transformer-based policy network. The inputs consist of the bin state, item–face buffer with geometric and temporal attributes, and the current preference vector. These are embedded and processed by stacked attention-based encoders called Transformer-Select. The actor outputs logits over item–face candidates conditioned on the preference vector, while the preference-conditioned critic predicts vector-valued returns for space utilization and operational time, aligned with the multi-objective reward structure.
  • Figure 3: The Transformer-Select module used in our proposed method STEP.
  • Figure 4: Pareto front of evaluated policies in the space–time trade-off space (left). Each point represents a distinct policy configuration, with the frontier highlighting non-dominated solutions balancing space utilization and operational time. Packing visualizations for STEP-1 (i) and STEP-3 (ii) at three preference vectors A = $[0.02, 0.98]$, B = $[0.53, 0.47]$, and C = $[0.95, 0.05]$ are shown (right), where the first element weights space and the second weights time. The corresponding $\omega$ values are marked with red circles in the Pareto front.
  • Figure 5: Average Space Utilization ($\%$) and Average Time Taken under increasing variability in item box sizes.
  • ...and 1 more figures