Table of Contents
Fetching ...

Curriculum RL meets Monte Carlo Planning: Optimization of a Real World Container Management Problem

Abhijeet Pendyala, Tobias Glasmachers

TL;DR

The paper tackles safe, high-throughput container management in waste-sorting facilities where a single processing unit creates scheduling bottlenecks and collision risks. It introduces a hybrid approach that combines a curriculum-learning PPO (PPO-CL) with an offline Monte Carlo–trained collision model integrated at inference (PPO-CL-CM). Key contributions include a three-phase reward curriculum to handle delayed, sparse rewards and dual-peak emptying targets, plus an offline pairwise collision predictor used to override risky no-ops with minimal runtime overhead. Empirical results across container-to-PU ratios from 7:1 to 12:1 show that PPO-CL-CM reduces collisions and safety-limit violations while preserving or increasing total throughput, providing actionable guidance for real-world facility design and scaling. The work also outlines pathways to extend to multiple PUs and dynamic capacity scenarios, highlighting the practical impact of combining domain-aware safety checks with reinforcement learning in industrial settings.

Abstract

In this work, we augment reinforcement learning with an inference-time collision model to ensure safe and efficient container management in a waste-sorting facility with limited processing capacity. Each container has two optimal emptying volumes that trade off higher throughput against overflow risk. Conventional reinforcement learning (RL) approaches struggle under delayed rewards, sparse critical events, and high-dimensional uncertainty -- failing to consistently balance higher-volume empties with the risk of safety-limit violations. To address these challenges, we propose a hybrid method comprising: (1) a curriculum-learning pipeline that incrementally trains a PPO agent to handle delayed rewards and class imbalance, and (2) an offline pairwise collision model used at inference time to proactively avert collisions with minimal online cost. Experimental results show that our targeted inference-time collision checks significantly improve collision avoidance, reduce safety-limit violations, maintain high throughput, and scale effectively across varying container-to-PU ratios. These findings offer actionable guidelines for designing safe and efficient container-management systems in real-world facilities.

Curriculum RL meets Monte Carlo Planning: Optimization of a Real World Container Management Problem

TL;DR

The paper tackles safe, high-throughput container management in waste-sorting facilities where a single processing unit creates scheduling bottlenecks and collision risks. It introduces a hybrid approach that combines a curriculum-learning PPO (PPO-CL) with an offline Monte Carlo–trained collision model integrated at inference (PPO-CL-CM). Key contributions include a three-phase reward curriculum to handle delayed, sparse rewards and dual-peak emptying targets, plus an offline pairwise collision predictor used to override risky no-ops with minimal runtime overhead. Empirical results across container-to-PU ratios from 7:1 to 12:1 show that PPO-CL-CM reduces collisions and safety-limit violations while preserving or increasing total throughput, providing actionable guidance for real-world facility design and scaling. The work also outlines pathways to extend to multiple PUs and dynamic capacity scenarios, highlighting the practical impact of combining domain-aware safety checks with reinforcement learning in industrial settings.

Abstract

In this work, we augment reinforcement learning with an inference-time collision model to ensure safe and efficient container management in a waste-sorting facility with limited processing capacity. Each container has two optimal emptying volumes that trade off higher throughput against overflow risk. Conventional reinforcement learning (RL) approaches struggle under delayed rewards, sparse critical events, and high-dimensional uncertainty -- failing to consistently balance higher-volume empties with the risk of safety-limit violations. To address these challenges, we propose a hybrid method comprising: (1) a curriculum-learning pipeline that incrementally trains a PPO agent to handle delayed rewards and class imbalance, and (2) an offline pairwise collision model used at inference time to proactively avert collisions with minimal online cost. Experimental results show that our targeted inference-time collision checks significantly improve collision avoidance, reduce safety-limit violations, maintain high throughput, and scale effectively across varying container-to-PU ratios. These findings offer actionable guidelines for designing safe and efficient container-management systems in real-world facilities.

Paper Structure

This paper contains 26 sections, 1 equation, 4 figures, 2 tables, 2 algorithms.

Figures (4)

  • Figure 1: Layout sketch of a facility with 12 containers and a PU, connected with conveyor belts. The containers are filled from above, with their current fill states indicated by the shaded areas.
  • Figure 2: Performance metrics comparison between PPO-CL and PPO-CL-CM methods across different container configurations (7b1p to 12b1p). The bars show mean values and error bars indicate standard deviation. Left: Press idle time shows the duration the press remains inactive. Right: Total volume processed indicates the amount of material handled during one inference episode of 600 timesteps.
  • Figure 3: Comparison of Coefficient of Variation (CV%) across different collision probability thresholds for all container configurations. Each subplot shows the performance of PPO-CL and PPO-CL-CM methods for a specific configuration. Lower CV% indicates more consistent performance.
  • Figure 4: Comparison of safety limit violation percentages across different bunker configurations. Bars show the percentage of emptying actions that exceeded the safety limit for each bunker configuration using PPO-CL and PPO-CL-CM methods.