Table of Contents
Fetching ...

Solving a Real-World Optimization Problem Using Proximal Policy Optimization with Curriculum Learning and Reward Engineering

Abhijeet Pendyala, Asma Atamna, Tobias Glasmachers

TL;DR

This paper tackles the real-world problem of optimizing a high-throughput waste sorting facility under multiple objectives, including safety, throughput, and resource usage, where rewards are delayed and rare critical actions occur infrequently. It introduces a curriculum-learning framework for PPO, incorporating five phased stages, reward engineering (Gaussian, Custom, and Precision rewards), and action-masking to gradually expose the agent to increasing environmental complexity while maintaining safety. The study demonstrates that PPO-CL outperforms a single-criterion PPO baseline and approaches or surpasses a hand-crafted Optimal Analytic agent in terms of volume accuracy, PU utilization, and safety, achieving near-zero safety violations in many scenarios. The work highlights the practical value of curriculum learning for complex, multi-criteria industrial RL tasks and suggests future directions for preventing resource-contention (PU collisions) and extending the approach to broader real-world control problems.

Abstract

We present a proximal policy optimization (PPO) agent trained through curriculum learning (CL) principles and meticulous reward engineering to optimize a real-world high-throughput waste sorting facility. Our work addresses the challenge of effectively balancing the competing objectives of operational safety, volume optimization, and minimizing resource usage. A vanilla agent trained from scratch on these multiple criteria fails to solve the problem due to its inherent complexities. This problem is particularly difficult due to the environment's extremely delayed rewards with long time horizons and class (or action) imbalance, with important actions being infrequent in the optimal policy. This forces the agent to anticipate long-term action consequences and prioritize rare but rewarding behaviours, creating a non-trivial reinforcement learning task. Our five-stage CL approach tackles these challenges by gradually increasing the complexity of the environmental dynamics during policy transfer while simultaneously refining the reward mechanism. This iterative and adaptable process enables the agent to learn a desired optimal policy. Results demonstrate that our approach significantly improves inference-time safety, achieving near-zero safety violations in addition to enhancing waste sorting plant efficiency.

Solving a Real-World Optimization Problem Using Proximal Policy Optimization with Curriculum Learning and Reward Engineering

TL;DR

This paper tackles the real-world problem of optimizing a high-throughput waste sorting facility under multiple objectives, including safety, throughput, and resource usage, where rewards are delayed and rare critical actions occur infrequently. It introduces a curriculum-learning framework for PPO, incorporating five phased stages, reward engineering (Gaussian, Custom, and Precision rewards), and action-masking to gradually expose the agent to increasing environmental complexity while maintaining safety. The study demonstrates that PPO-CL outperforms a single-criterion PPO baseline and approaches or surpasses a hand-crafted Optimal Analytic agent in terms of volume accuracy, PU utilization, and safety, achieving near-zero safety violations in many scenarios. The work highlights the practical value of curriculum learning for complex, multi-criteria industrial RL tasks and suggests future directions for preventing resource-contention (PU collisions) and extending the approach to broader real-world control problems.

Abstract

We present a proximal policy optimization (PPO) agent trained through curriculum learning (CL) principles and meticulous reward engineering to optimize a real-world high-throughput waste sorting facility. Our work addresses the challenge of effectively balancing the competing objectives of operational safety, volume optimization, and minimizing resource usage. A vanilla agent trained from scratch on these multiple criteria fails to solve the problem due to its inherent complexities. This problem is particularly difficult due to the environment's extremely delayed rewards with long time horizons and class (or action) imbalance, with important actions being infrequent in the optimal policy. This forces the agent to anticipate long-term action consequences and prioritize rare but rewarding behaviours, creating a non-trivial reinforcement learning task. Our five-stage CL approach tackles these challenges by gradually increasing the complexity of the environmental dynamics during policy transfer while simultaneously refining the reward mechanism. This iterative and adaptable process enables the agent to learn a desired optimal policy. Results demonstrate that our approach significantly improves inference-time safety, achieving near-zero safety violations in addition to enhancing waste sorting plant efficiency.
Paper Structure (29 sections, 6 equations, 6 figures, 2 tables, 2 algorithms)

This paper contains 29 sections, 6 equations, 6 figures, 2 tables, 2 algorithms.

Figures (6)

  • Figure 1: Layout sketch of a facility with 11 containers and 2 PUs, connected with conveyor belts. The containers are filled from above, with their current fill states indicated by the shaded areas.
  • Figure 2: Plots showing the various reward functions: Simple Gaussian (left), Custom Reward (centre), and Precise Reward (right).
  • Figure 3: A single rollout (best agent out of 15) of the PPO-CL (left) and PPO-volume criteria on a test environment with 11 containers. Displayed are the volumes, emptying actions, rewards, and time to process by PU-1 and PU-2.
  • Figure 4: Comparison of key performance metrics across different agents, collected over $15$ rollouts of the best policy for both PPO agents. The left figure presents the average total PU utilization across agents. The right figure details the percentage of safety violations
  • Figure 5: ECDFs of emptying volumes of all 11 containers collected over $15$ rollouts of the best policy for PPO-Volume criteria, PPO-CL, and Optimal analytic agent on a test environment. Average fill rates are indicated in volume units per second. The derivatives of the curves are the PDFs of emptying volumes. Therefore, a steep incline indicates that the corresponding volume is frequent in the corresponding density.
  • ...and 1 more figures