Green Resilience of Cyber-Physical Systems: Doctoral Dissertation
Diaeddin Rimawi
TL;DR
This work addresses the challenge of maintaining performance in online collaborative AI systems (OL-CAIS) under exogenous disruptions while minimizing energy impact. It introduces the GResilience framework, combining one-agent optimization, two-agent game theory, and reinforcement learning to balance resilience and greenness, and validates these policies on the CORAL collaborative robot via real-world and simulated experiments. A resilience model using Autonomous Classification Ratio (ACR) tracks performance evolution across steady, disruptive, and final states, guiding automatic recovery with a measurements framework that quantifies recovery speed, steadiness, green efficiency, and autonomy. Containerization is shown to halve CO$_2$ emissions and improve resilience, while RL-agent policies offer strongest performance at higher computational and environmental cost. The work also analyzes catastrophic forgetting and proposes runtime policies to sustain steady performance over repeated disruptions, culminating in a practical toolkit (CAIS-DMA) for green, resilient OL-CAIS deployment.
Abstract
Cyber-physical systems (CPS) combine computational and physical components. Online Collaborative AI System (OL-CAIS) is a type of CPS that learn online in collaboration with humans to achieve a common goal, which makes it vulnerable to disruptive events that degrade performance. Decision-makers must therefore restore performance while limiting energy impact, creating a trade-off between resilience and greenness. This research addresses how to balance these two properties in OL-CAIS. It aims to model resilience for automatic state detection, develop agent-based policies that optimize the greenness-resilience trade-off, and understand catastrophic forgetting to maintain performance consistency. We model OL-CAIS behavior through three operational states: steady, disruptive, and final. To support recovery during disruptions, we introduce the GResilience framework, which provides recovery strategies through multi-objective optimization (one-agent), game-theoretic decision-making (two-agent), and reinforcement learning (RL-agent). We also design a measurement framework to quantify resilience and greenness. Empirical evaluation uses real and simulated experiments with a collaborative robot learning object classification from human demonstrations. Results show that the resilience model captures performance transitions during disruptions, and that GResilience policies improve green recovery by shortening recovery time, stabilizing performance, and reducing human dependency. RL-agent policies achieve the strongest results, although with a marginal increase in CO2 emissions. We also observe catastrophic forgetting after repeated disruptions, while our policies help maintain steadiness. A comparison with containerized execution shows that containerization cuts CO2 emissions by half. Overall, this research provides models, metrics, and policies that ensure the green recovery of OL-CAIS.
