Table of Contents
Fetching ...

Resilient by Design -- Active Inference for Distributed Continuum Intelligence

Praveen Kumar Donta, Alfreds Lapkovskis, Enzo Mingozzi, Schahram Dustdar

TL;DR

This work tackles resilience in the Distributed Computing Continuum (DCC) by introducing PAIR-Agent, a probabilistic active-inference framework that builds a time-evolving causal fault graph from device logs, performs uncertainty-aware fault inference via Markov blankets and the free energy principle, and achieves self-healing through active inference. The method jointly handles heterogeneity, partial telemetry, and dynamic workloads by localizing inference to fault neighborhoods and minimizing expected free energy to guide corrective actions. Theoretical results establish locality and scalability of inference, the existence of a best possible variational posterior, safety of action choices, and robustness to missing data, supporting autonomous resilience across cloud, fog, edge, and IoT layers. Overall, PAIR-Agent offers a principled, self-contained approach to continuous perception–inference–action loops that sustain service continuity and stability in complex DCC environments.

Abstract

Failures are the norm in highly complex and heterogeneous devices spanning the distributed computing continuum (DCC), from resource-constrained IoT and edge nodes to high-performance computing systems. Ensuring reliability and global consistency across these layers remains a major challenge, especially for AI-driven workloads requiring real-time, adaptive coordination. This work-in-progress paper introduces a Probabilistic Active Inference Resilience Agent (PAIR-Agent) to achieve resilience in DCC systems. PAIR-Agent performs three core operations: (i) constructing a causal fault graph from device logs, (ii) identifying faults while managing certainties and uncertainties using Markov blankets and the free energy principle, and (iii) autonomously healing issues through active inference. Through continuous monitoring and adaptive reconfiguration, the agent maintains service continuity and stability under diverse failure conditions. Theoretical validations confirm the reliability and effectiveness of the proposed framework.

Resilient by Design -- Active Inference for Distributed Continuum Intelligence

TL;DR

This work tackles resilience in the Distributed Computing Continuum (DCC) by introducing PAIR-Agent, a probabilistic active-inference framework that builds a time-evolving causal fault graph from device logs, performs uncertainty-aware fault inference via Markov blankets and the free energy principle, and achieves self-healing through active inference. The method jointly handles heterogeneity, partial telemetry, and dynamic workloads by localizing inference to fault neighborhoods and minimizing expected free energy to guide corrective actions. Theoretical results establish locality and scalability of inference, the existence of a best possible variational posterior, safety of action choices, and robustness to missing data, supporting autonomous resilience across cloud, fog, edge, and IoT layers. Overall, PAIR-Agent offers a principled, self-contained approach to continuous perception–inference–action loops that sustain service continuity and stability in complex DCC environments.

Abstract

Failures are the norm in highly complex and heterogeneous devices spanning the distributed computing continuum (DCC), from resource-constrained IoT and edge nodes to high-performance computing systems. Ensuring reliability and global consistency across these layers remains a major challenge, especially for AI-driven workloads requiring real-time, adaptive coordination. This work-in-progress paper introduces a Probabilistic Active Inference Resilience Agent (PAIR-Agent) to achieve resilience in DCC systems. PAIR-Agent performs three core operations: (i) constructing a causal fault graph from device logs, (ii) identifying faults while managing certainties and uncertainties using Markov blankets and the free energy principle, and (iii) autonomously healing issues through active inference. Through continuous monitoring and adaptive reconfiguration, the agent maintains service continuity and stability under diverse failure conditions. Theoretical validations confirm the reliability and effectiveness of the proposed framework.

Paper Structure

This paper contains 8 sections, 4 equations, 1 figure.

Figures (1)

  • Figure 1: The proposed PAIR-Agent for resilience in the DCC collects and normalizes logs to form a CFG, infers faults via Markov blanket and FEP analysis, and heals through active inference to restore system stability.