Table of Contents
Fetching ...

Large-scale automatic carbon ion treatment planning for head and neck cancers via parallel multi-agent reinforcement learning

Jueye Zhang, Chao Yang, Youfang Lai, Kai-Wen Li, Wenting Yan, Yunzhou Xia, Haimei Zhang, Jingjing Zhou, Gen Yang, Chen Lin, Tian Li, Yibao Zhang

TL;DR

The paper tackles the challenge of head-and-neck IMCT planning, where balancing CTV coverage with sparing many nearby OARs is difficult and slowed by RBE modeling. It introduces a parallel multi-agent reinforcement learning framework using a CTDE backbone (QMIX) with DRQN encoding to tune 45 treatment-planning parameters in parallel, leveraging DVH histories, a linear action-to-value mapping, and an absolute reward design. On a 20-case dataset, RL plans were comparable to or better than expert manual plans, with statistically meaningful improvements for several OARs and favorable DVHs for most organs. This approach accelerates plan generation and demonstrates scalable automatic planning for IMCT, though broader, multi-institution validation is needed to confirm generalizability.

Abstract

Head-and-neck cancer (HNC) planning is difficult because multiple critical organs-at-risk (OARs) are close to complex targets. Intensity-modulated carbon-ion therapy (IMCT) offers superior dose conformity and OAR sparing but remains slow due to relative biological effectiveness (RBE) modeling, leading to laborious, experience-based, and often suboptimal tuning of many treatment-planning parameters (TPPs). Recent deep learning (DL) methods are limited by data bias and plan feasibility, while reinforcement learning (RL) struggles to efficiently explore the exponentially large TPP search space. We propose a scalable multi-agent RL (MARL) framework for parallel tuning of 45 TPPs in IMCT. It uses a centralized-training decentralized-execution (CTDE) QMIX backbone with Double DQN, Dueling DQN, and recurrent encoding (DRQN) for stable learning in a high-dimensional, non-stationary environment. To enhance efficiency, we (1) use compact historical DVH vectors as state inputs, (2) apply a linear action-to-value transform mapping small discrete actions to uniform parameter adjustments, and (3) design an absolute, clinically informed piecewise reward aligned with plan scores. A synchronous multi-process worker system interfaces with the PHOENIX TPS for parallel optimization and accelerated data collection. On a head-and-neck dataset (10 training, 10 testing), the method tuned 45 parameters simultaneously and produced plans comparable to or better than expert manual ones (relative plan score: RL $85.93\pm7.85%$ vs Manual $85.02\pm6.92%$), with significant (p-value $<$ 0.05) improvements for five OARs. The framework efficiently explores high-dimensional TPP spaces and generates clinically competitive IMCT plans through direct TPS interaction, notably improving OAR sparing.

Large-scale automatic carbon ion treatment planning for head and neck cancers via parallel multi-agent reinforcement learning

TL;DR

The paper tackles the challenge of head-and-neck IMCT planning, where balancing CTV coverage with sparing many nearby OARs is difficult and slowed by RBE modeling. It introduces a parallel multi-agent reinforcement learning framework using a CTDE backbone (QMIX) with DRQN encoding to tune 45 treatment-planning parameters in parallel, leveraging DVH histories, a linear action-to-value mapping, and an absolute reward design. On a 20-case dataset, RL plans were comparable to or better than expert manual plans, with statistically meaningful improvements for several OARs and favorable DVHs for most organs. This approach accelerates plan generation and demonstrates scalable automatic planning for IMCT, though broader, multi-institution validation is needed to confirm generalizability.

Abstract

Head-and-neck cancer (HNC) planning is difficult because multiple critical organs-at-risk (OARs) are close to complex targets. Intensity-modulated carbon-ion therapy (IMCT) offers superior dose conformity and OAR sparing but remains slow due to relative biological effectiveness (RBE) modeling, leading to laborious, experience-based, and often suboptimal tuning of many treatment-planning parameters (TPPs). Recent deep learning (DL) methods are limited by data bias and plan feasibility, while reinforcement learning (RL) struggles to efficiently explore the exponentially large TPP search space. We propose a scalable multi-agent RL (MARL) framework for parallel tuning of 45 TPPs in IMCT. It uses a centralized-training decentralized-execution (CTDE) QMIX backbone with Double DQN, Dueling DQN, and recurrent encoding (DRQN) for stable learning in a high-dimensional, non-stationary environment. To enhance efficiency, we (1) use compact historical DVH vectors as state inputs, (2) apply a linear action-to-value transform mapping small discrete actions to uniform parameter adjustments, and (3) design an absolute, clinically informed piecewise reward aligned with plan scores. A synchronous multi-process worker system interfaces with the PHOENIX TPS for parallel optimization and accelerated data collection. On a head-and-neck dataset (10 training, 10 testing), the method tuned 45 parameters simultaneously and produced plans comparable to or better than expert manual ones (relative plan score: RL vs Manual ), with significant (p-value 0.05) improvements for five OARs. The framework efficiently explores high-dimensional TPP spaces and generates clinically competitive IMCT plans through direct TPS interaction, notably improving OAR sparing.

Paper Structure

This paper contains 12 sections, 4 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: The overview of our parallel MARL algorithm. (a) Multi-process data workers interact with TPS in parallel to fill the data bank with planning experience. (b) The multi-agent system retrieves data from the data bank to optimize its policy under the supervision of a central director, who facilitates global communication among different agents. (c) The trained policy subsequently guides data collection through an action-to-value transformation. Data collection and agent training are carried out sequentially and mutually enhance each other, ultimately resulting in high-quality plan data in the data bank and a well-trained treatment planner.
  • Figure 2: The overview of our multi-agent system. Every agent take all the historical DVHs as input to estimate $V$-value and $A$-value for $Q$-value. All the estimated $Q$-value is then fed into a central director $f_{\theta }$ to get the final estimation of $Q_{tot}$-value.
  • Figure 3: Training results. (a) Performance of all the patients (mean $\pm$ std; shaded area indicates one standard deviation). (b) Episode return of all the data workers (mean $\pm$ std; shaded area indicates one standard deviation). (c) Temporal Difference (TD) loss. (d) The Q-value of all the agents.
  • Figure 4: Statistical comparison between "RL" and "Manual". (a) Box plot of the relative plan score (with p-values indicating whether "RL" outperforms "Manual"). (b) Box plot of all the plan metrics (with p-values indicating whether "RL" outperforms "Manual"; highlighted in red if statistically significant at the 0.05 threshold). (c) Radar chart of relative score of all the plan metrics.
  • Figure 5: Average DVHs of all the OARs and CTV. (mean $\pm$ std; shaded area indicates one standard deviation)
  • ...and 1 more figures