Table of Contents
Fetching ...

Policy-Based Radiative Transfer: Solving the $2$-Level Atom Non-LTE Problem using Soft Actor-Critic Reinforcement Learning

Brandon Panos, Ivan Milic

TL;DR

This work reframes the classical 2-level atom non-LTE radiative transfer problem as a control task where a reinforcement learning agent learns a depth-dependent source function $S(τ_c)$ that self-consistently satisfies the SE. Using Soft Actor-Critic to optimize a parameterized sigmoid-based representation of $S(τ_c)$, the agent interacts with a radiative transfer solver to maximize a reward based on the SE residual, avoiding labeled data or backpropagation through the solver. In a 1D plane-parallel, CRD setup, the SAC policy achieves SE with fewer inner-loop iterations than the traditional ALI method, whereas a greedy feedforward network fails due to the moving-target nature of the problem. The study demonstrates a Λ*-free RL pathway to SE with potential applicability to more complex atmospheres and geometries, offering a flexible framework that could accelerate radiative transfer computations when generalized beyond the training regime.

Abstract

We present a novel reinforcement learning (RL) approach for solving the classical 2-level atom non-LTE radiative transfer problem by framing it as a control task in which an RL agent learns a depth-dependent source function $S(τ)$ that self-consistently satisfies the equation of statistical equilibrium (SE). The agent's policy is optimized entirely via reward-based interactions with a radiative transfer engine, without explicit knowledge of the ground truth. This method bypasses the need for constructing approximate lambda operators ($Λ^*$) common in accelerated iterative schemes. Additionally, it requires no extensive precomputed labeled datasets to extract a supervisory signal, and avoids backpropagating gradients through the complex RT solver itself. Finally, we show through experiment that a simple feedforward neural network trained greedily cannot solve for SE, possibly due to the moving target nature of the problem. Our $Λ^*-\text{Free}$ method offers potential advantages for complex scenarios (e.g., atmospheres with enhanced velocity fields, multi-dimensional geometries, or complex microphysics) where $Λ^*$ construction or solver differentiability is challenging. Additionally, the agent can be incentivized to find more efficient policies by manipulating the discount factor, leading to a reprioritization of immediate rewards. If demonstrated to generalize past its training data, this RL framework could serve as an alternative or accelerated formalism to achieve SE. To the best of our knowledge, this study represents the first application of reinforcement learning in solar physics that directly solves for a fundamental physical constraint.

Policy-Based Radiative Transfer: Solving the $2$-Level Atom Non-LTE Problem using Soft Actor-Critic Reinforcement Learning

TL;DR

This work reframes the classical 2-level atom non-LTE radiative transfer problem as a control task where a reinforcement learning agent learns a depth-dependent source function that self-consistently satisfies the SE. Using Soft Actor-Critic to optimize a parameterized sigmoid-based representation of , the agent interacts with a radiative transfer solver to maximize a reward based on the SE residual, avoiding labeled data or backpropagation through the solver. In a 1D plane-parallel, CRD setup, the SAC policy achieves SE with fewer inner-loop iterations than the traditional ALI method, whereas a greedy feedforward network fails due to the moving-target nature of the problem. The study demonstrates a Λ*-free RL pathway to SE with potential applicability to more complex atmospheres and geometries, offering a flexible framework that could accelerate radiative transfer computations when generalized beyond the training regime.

Abstract

We present a novel reinforcement learning (RL) approach for solving the classical 2-level atom non-LTE radiative transfer problem by framing it as a control task in which an RL agent learns a depth-dependent source function that self-consistently satisfies the equation of statistical equilibrium (SE). The agent's policy is optimized entirely via reward-based interactions with a radiative transfer engine, without explicit knowledge of the ground truth. This method bypasses the need for constructing approximate lambda operators () common in accelerated iterative schemes. Additionally, it requires no extensive precomputed labeled datasets to extract a supervisory signal, and avoids backpropagating gradients through the complex RT solver itself. Finally, we show through experiment that a simple feedforward neural network trained greedily cannot solve for SE, possibly due to the moving target nature of the problem. Our method offers potential advantages for complex scenarios (e.g., atmospheres with enhanced velocity fields, multi-dimensional geometries, or complex microphysics) where construction or solver differentiability is challenging. Additionally, the agent can be incentivized to find more efficient policies by manipulating the discount factor, leading to a reprioritization of immediate rewards. If demonstrated to generalize past its training data, this RL framework could serve as an alternative or accelerated formalism to achieve SE. To the best of our knowledge, this study represents the first application of reinforcement learning in solar physics that directly solves for a fundamental physical constraint.

Paper Structure

This paper contains 8 sections, 7 equations, 6 figures.

Figures (6)

  • Figure 1: Diagram of the training loop: In a clockwise fashion; the source function is initiated to the Planck function $B$, which is then sent to the agent as a state. The agent's policy decides on an action that generates four parameters $\mathbf{p}$ that are used to construct a smooth, well-behaved source function across the entire depth scale. The agent's predicted solution is passed to the physics engine, which generates a conditioned "implied" source function that would satisfy SE. The residual of the agent and implied source is used as a reward signal to instruct the agent's policy. The inner loop iterates until either a max step criterion is reached or the agent obtains the target, defining a single episode. Once the inner loop terminates, the source function is once again initiated to the Planck function, and the agent can try and refine its policy.
  • Figure 2: Reachable solution space of the parameterized source function. The density heatmap (log scale) shows the frequency of $\log_{10}(S)$ values versus $\log_{10}(\tau_c)$ based on $50,000$ random parameter samples. Overlays show the initial guess (black), target ALI solution (blue dashed), and example random profiles (light gray).
  • Figure 3: Training performance of the SAC agent. The figure shows the mean normalized reward, critic loss, actor loss, and entropy coefficient $\alpha$, as a function of training steps. The steadily increasing reward demonstrates successful learning, while the critic loss decreases, indicating convergence of the value function estimate. The decreasing entropy coefficient signifies a shift from exploration towards policy exploitation.
  • Figure 4: Upper panel: ALI method converges to SE (dashed blue line) after multiple iterations. Middle panel: The SAC policy drives the simple non-LTE simulation into SE with fewer iterations. Decreasing the discount factor promotes policies that converge faster. Lower Panel: Evolution of line-of-sight ($\mu=1$) observed intensity as a function of the agents policy.
  • Figure 5: Evolution of parameters during SAC training: The six phase planes show how the source function parameters change over time, with darker blue dots indicating later agent actions. The target solution of SE in the parameter space is indicated by a black star, while the policy's optimal solution is indicated by a black square. The plot shows how the parameters migrate during training towards the target setting.
  • ...and 1 more figures