Table of Contents
Fetching ...

Logic-informed reinforcement learning for cross-domain optimization of large-scale cyber-physical systems

Guangxi Wan, Peng Zeng, Xiaoting Dong, Chunhe Song, Shijie Cui, Dong Li, Qingwei Dong, Yiyang Liu, Hongfei Bai

Abstract

Cyber-physical systems (CPS) require the joint optimization of discrete cyber actions and continuous physical parameters under stringent safety logic constraints. However, existing hierarchical approaches often compromise global optimality, whereas reinforcement learning (RL) in hybrid action spaces often relies on brittle reward penalties, masking, or shielding and struggles to guarantee constraint satisfaction. We present logic-informed reinforcement learning (LIRL), which equips standard policy-gradient algorithms with projection that maps a low-dimensional latent action onto the admissible hybrid manifold defined on-the-fly by first-order logic. This guarantees feasibility of every exploratory step without penalty tuning. Experimental evaluations have been conducted across multiple scenarios, including industrial manufacturing, electric vehicle charging stations, and traffic signal control, in all of which the proposed method outperforms existing hierarchical optimization approaches. Taking a robotic reducer assembly system in industrial manufacturing as an example, LIRL achieves a 36.47\% to 44.33\% reduction at most in the combined makespan-energy objective compared to conventional industrial hierarchical scheduling methods. Meanwhile, it consistently maintains zero constraint violations and significantly surpasses state-of-the-art hybrid-action reinforcement learning baselines. Thanks to its declarative logic-based constraint formulation, the framework can be seamlessly transferred to other domains such as smart transportation and smart grid, thereby paving the way for safe and real-time optimization in large-scale CPS.

Logic-informed reinforcement learning for cross-domain optimization of large-scale cyber-physical systems

Abstract

Cyber-physical systems (CPS) require the joint optimization of discrete cyber actions and continuous physical parameters under stringent safety logic constraints. However, existing hierarchical approaches often compromise global optimality, whereas reinforcement learning (RL) in hybrid action spaces often relies on brittle reward penalties, masking, or shielding and struggles to guarantee constraint satisfaction. We present logic-informed reinforcement learning (LIRL), which equips standard policy-gradient algorithms with projection that maps a low-dimensional latent action onto the admissible hybrid manifold defined on-the-fly by first-order logic. This guarantees feasibility of every exploratory step without penalty tuning. Experimental evaluations have been conducted across multiple scenarios, including industrial manufacturing, electric vehicle charging stations, and traffic signal control, in all of which the proposed method outperforms existing hierarchical optimization approaches. Taking a robotic reducer assembly system in industrial manufacturing as an example, LIRL achieves a 36.47\% to 44.33\% reduction at most in the combined makespan-energy objective compared to conventional industrial hierarchical scheduling methods. Meanwhile, it consistently maintains zero constraint violations and significantly surpasses state-of-the-art hybrid-action reinforcement learning baselines. Thanks to its declarative logic-based constraint formulation, the framework can be seamlessly transferred to other domains such as smart transportation and smart grid, thereby paving the way for safe and real-time optimization in large-scale CPS.

Paper Structure

This paper contains 23 sections, 6 equations, 7 figures.

Figures (7)

  • Figure 1: Overview of CPS cross-domain optimization and LIRL framework. A. The concept of Hierarchical optimization and cross-domain optimization. Hierarchical optimization struggles to achieve the global optimum due to the lack of a strict optimal substructure between levels, while cross-domain optimization can attain the globally optimal solution. B. The LIRL introduces two key innovations: latent action projection and dynamic valid-action-space partitioning. It first maps hybrid discrete-continuous actions into a unified latent continuous space, then projects them via function $\Pi_s$ into the valid explicit action space—which is adaptively partitioned based on system state and constraints. C. The characteristics of the LIRL training process guarantee the continuous satisfaction of constraints throughout the training period, while steadily enhancing performance. D. The process of latent action projection and valid action space partitioning, The valid action space represents the constrained solution space under varying CPS system states. Decision-making activates only when the valid action space is non-empty and halts automatically otherwise.
  • Figure 2: Overview of $R^2AMS$. A. $R^2AMS$ consists of modular robotic workcells, each considered a subsystem. To evaluate the generalization capability of LIRL, we designed two dynamic scenarios: task duration uncertainty and robot breakdown. B. Five steps for reducer assembly: place bottom shell, small gear assembling, large gear assembling, top bottom shell and move reducer to buffer. C. The structure of hybrid decision-making space. $T_i$ is the $i\hbox{-}th$ reducer; $p_{ij}$ is the $j\hbox{-}th$ stage of the $i\hbox{-}th$ reducer assembly task; $q_k$ is the $k\hbox{-}th$ robot workcell; $u_k^{ij}$ represents the trajectory configuration parameters of the robot workcell $q_k$ when completing $p_{ij}$. D. The energy signature of a robot workcell to process five operations of reducer assembling. The horizontal axis corresponds to the action processing time, where shorter times correspond to faster robot execution.
  • Figure 3: Results. A. The comparison of cross-domain optimization(cross-opt) solved by LIRL with Energy-opt and Time-opt methods. Panels I-IV show the reward distribution for varying weights across four scales, with a higher reward denoting superior optimization performance. B. Performance comparison of different optimization methods, measured by a weighted sum of normalized makespan and energy consumption. C.I. It presents a comparison of the training curves between LIRL and the baselines. C.II. The standard deviation of the post-convergence reward distribution quantifies its concentration, with smaller values indicating more stable convergence. C.III. Comparison of convergence performance among different algorithms. The numbers on each algorithm indicate the training episodes required for convergence, defined as the point after which the value remains stable for 95% of the final value. A lower number corresponds to faster convergence.
  • Figure 4: Ablation results. A. The learning curve comparison of four scales. B. The post-convergence reward distribution. C. Comparison of the average rewards of LIRL and Mask under different scales
  • Figure 5: Results of the robustness testing experiment. The "Training" performance refers to models trained and evaluated at specific perturbation levels, whereas the "Generalization" performance is assessed on a single model trained under disturbance-free conditions. "Cov" indicates that the solution obtained through "Generalization" covers the range of the "Training" solution. A. Performance of LIRL-trained strategies under different noise levels. B. Performance of LIRL-trained strategies under different robot failure rate.
  • ...and 2 more figures