Table of Contents
Fetching ...

Automating proton PBS treatment planning for head and neck cancers using policy gradient-based deep reinforcement learning

Qingqing Wang, Chang Chang

TL;DR

Proton PBS planning for head-and-neck cancer is hampered by the need to balance many objectives and the limitations of discrete-action, Q-learning DRL methods. The authors propose a PPO-based, continuous-action DRL framework with a Transformer-based actor–critic policy and a dose distribution–based reward, paired with an in-house L-BFGS optimizer to initialize spot MUs. They demonstrate human-level performance on H&N cases and show generalizability to liver plans, achieving superior target coverage and improved OAR sparing compared with human planners. This approach offers a scalable, automated solution for complex proton PBS planning and holds practical potential for broader site applicability.

Abstract

Proton pencil beam scanning (PBS) treatment planning for head and neck (H&N) cancers is a time-consuming and experience-demanding task where a large number of planning objectives are involved. Deep reinforcement learning (DRL) has recently been introduced to the planning processes of intensity-modulated radiation therapy and brachytherapy for prostate, lung, and cervical cancers. However, existing approaches are built upon the Q-learning framework and weighted linear combinations of clinical metrics, suffering from poor scalability and flexibility and only capable of adjusting a limited number of planning objectives in discrete action spaces. We propose an automatic treatment planning model using the proximal policy optimization (PPO) algorithm and a dose distribution-based reward function for proton PBS treatment planning of H&N cancers. Specifically, a set of empirical rules is used to create auxiliary planning structures from target volumes and organs-at-risk (OARs), along with their associated planning objectives. These planning objectives are fed into an in-house optimization engine to generate the spot monitor unit (MU) values. A decision-making policy network trained using PPO is developed to iteratively adjust the involved planning objective parameters in a continuous action space and refine the PBS treatment plans using a novel dose distribution-based reward function. Proton H&N treatment plans generated by the model show improved OAR sparing with equal or superior target coverage when compared with human-generated plans. Moreover, additional experiments on liver cancer demonstrate that the proposed method can be successfully generalized to other treatment sites. To the best of our knowledge, this is the first DRL-based automatic treatment planning model capable of achieving human-level performance for H&N cancers.

Automating proton PBS treatment planning for head and neck cancers using policy gradient-based deep reinforcement learning

TL;DR

Proton PBS planning for head-and-neck cancer is hampered by the need to balance many objectives and the limitations of discrete-action, Q-learning DRL methods. The authors propose a PPO-based, continuous-action DRL framework with a Transformer-based actor–critic policy and a dose distribution–based reward, paired with an in-house L-BFGS optimizer to initialize spot MUs. They demonstrate human-level performance on H&N cases and show generalizability to liver plans, achieving superior target coverage and improved OAR sparing compared with human planners. This approach offers a scalable, automated solution for complex proton PBS planning and holds practical potential for broader site applicability.

Abstract

Proton pencil beam scanning (PBS) treatment planning for head and neck (H&N) cancers is a time-consuming and experience-demanding task where a large number of planning objectives are involved. Deep reinforcement learning (DRL) has recently been introduced to the planning processes of intensity-modulated radiation therapy and brachytherapy for prostate, lung, and cervical cancers. However, existing approaches are built upon the Q-learning framework and weighted linear combinations of clinical metrics, suffering from poor scalability and flexibility and only capable of adjusting a limited number of planning objectives in discrete action spaces. We propose an automatic treatment planning model using the proximal policy optimization (PPO) algorithm and a dose distribution-based reward function for proton PBS treatment planning of H&N cancers. Specifically, a set of empirical rules is used to create auxiliary planning structures from target volumes and organs-at-risk (OARs), along with their associated planning objectives. These planning objectives are fed into an in-house optimization engine to generate the spot monitor unit (MU) values. A decision-making policy network trained using PPO is developed to iteratively adjust the involved planning objective parameters in a continuous action space and refine the PBS treatment plans using a novel dose distribution-based reward function. Proton H&N treatment plans generated by the model show improved OAR sparing with equal or superior target coverage when compared with human-generated plans. Moreover, additional experiments on liver cancer demonstrate that the proposed method can be successfully generalized to other treatment sites. To the best of our knowledge, this is the first DRL-based automatic treatment planning model capable of achieving human-level performance for H&N cancers.
Paper Structure (17 sections, 12 equations, 8 figures)

This paper contains 17 sections, 12 equations, 8 figures.

Figures (8)

  • Figure 1: Framework of proposed automatic treatment planning method.
  • Figure 2: Auxiliary planning structures (masked in red) for target volumes.
  • Figure 3: Transformer-based actor-critic agent. The actor predicts a distribution for action sampling and the critic evaluates actor's performance. "Log_std" is the randomly initialized standard deviation of predicted distribution, and it is learned during the training procedure, together with other network parameters.
  • Figure 4: Dose distribution-based reward.
  • Figure 5: Rewards and losses obtained during the training procedure.
  • ...and 3 more figures