Table of Contents
Fetching ...

Provable Performance Bounds for Digital Twin-driven Deep Reinforcement Learning in Wireless Networks: A Novel Digital-Twin Bisimulation Metric

Zhenyu Tao, Wei Xu, Xiaohu You

TL;DR

DT-BSM provides a policy-independent metric to quantify the fidelity of a digital twin for DRL in wireless networks by measuring the discrepancy between the DT MDP and the real MDP using a Wasserstein-based distance. The main result shows deployment regret in the real network is bounded by a term proportional to the DT-BSM plus the DT-based policy sub-optimality, and a TV-based bound offers a scalable alternative. To address practicality, the paper introduces an empirical DT-BSM using sampling with convergence guarantees and a quantified sample size requirement. Numerical experiments on admission control demonstrate that deployment performance aligns with the theoretical bounds, validating DT-BSM as a practical, provable tool for DT-driven DRL in wireless networks.

Abstract

Digital twin (DT)-driven deep reinforcement learning (DRL) has emerged as a promising paradigm for wireless network optimization, offering safe and efficient training environment for policy exploration. However, in theory existing methods cannot always guarantee real-world performance of DT-trained policies before actual deployment, due to the absence of a universal metric for assessing DT's ability to support reliable DRL training transferrable to physical networks. In this paper, we propose the DT bisimulation metric (DT-BSM), a novel metric based on the Wasserstein distance, to quantify the discrepancy between Markov decision processes (MDPs) in both the DT and the corresponding real-world wireless network environment. We prove that for any DT-trained policy, the sub-optimality of its performance (regret) in the real-world deployment is bounded by a weighted sum of the DT-BSM and its sub-optimality within the MDP in the DT. Then, a modified DT-BSM based on the total variation distance is also introduced to avoid the prohibitive calculation complexity of Wasserstein distance for large-scale wireless network scenarios. Further, to tackle the challenge of obtaining accurate transition probabilities of the MDP in real world for the DT-BSM calculation, we propose an empirical DT-BSM method based on statistical sampling. We prove that the empirical DT-BSM always converges to the desired theoretical one, and quantitatively establish the relationship between the required sample size and the target level of approximation accuracy. Numerical experiments validate this first theoretical finding on the provable and calculable performance bounds for DT-driven DRL.

Provable Performance Bounds for Digital Twin-driven Deep Reinforcement Learning in Wireless Networks: A Novel Digital-Twin Bisimulation Metric

TL;DR

DT-BSM provides a policy-independent metric to quantify the fidelity of a digital twin for DRL in wireless networks by measuring the discrepancy between the DT MDP and the real MDP using a Wasserstein-based distance. The main result shows deployment regret in the real network is bounded by a term proportional to the DT-BSM plus the DT-based policy sub-optimality, and a TV-based bound offers a scalable alternative. To address practicality, the paper introduces an empirical DT-BSM using sampling with convergence guarantees and a quantified sample size requirement. Numerical experiments on admission control demonstrate that deployment performance aligns with the theoretical bounds, validating DT-BSM as a practical, provable tool for DT-driven DRL in wireless networks.

Abstract

Digital twin (DT)-driven deep reinforcement learning (DRL) has emerged as a promising paradigm for wireless network optimization, offering safe and efficient training environment for policy exploration. However, in theory existing methods cannot always guarantee real-world performance of DT-trained policies before actual deployment, due to the absence of a universal metric for assessing DT's ability to support reliable DRL training transferrable to physical networks. In this paper, we propose the DT bisimulation metric (DT-BSM), a novel metric based on the Wasserstein distance, to quantify the discrepancy between Markov decision processes (MDPs) in both the DT and the corresponding real-world wireless network environment. We prove that for any DT-trained policy, the sub-optimality of its performance (regret) in the real-world deployment is bounded by a weighted sum of the DT-BSM and its sub-optimality within the MDP in the DT. Then, a modified DT-BSM based on the total variation distance is also introduced to avoid the prohibitive calculation complexity of Wasserstein distance for large-scale wireless network scenarios. Further, to tackle the challenge of obtaining accurate transition probabilities of the MDP in real world for the DT-BSM calculation, we propose an empirical DT-BSM method based on statistical sampling. We prove that the empirical DT-BSM always converges to the desired theoretical one, and quantitatively establish the relationship between the required sample size and the target level of approximation accuracy. Numerical experiments validate this first theoretical finding on the provable and calculable performance bounds for DT-driven DRL.

Paper Structure

This paper contains 14 sections, 12 theorems, 57 equations, 6 figures.

Key Result

Theorem 1

For any policy $\pi$ learned in the DT MDP, the sub-optimality of its performance when transferred to the real MDP satisfies where $d_\textnormal{TV}$ is the modified DT-BSM constructed using the total variation distance.

Figures (6)

  • Figure 1: A schematic of DT-driven DRL
  • Figure 2: The comparison of BSM and DT-BSM
  • Figure 3: Quadrilateral inequality of DT-BSM.
  • Figure 4: Schematic of the transportation plan.
  • Figure 5: Sub-optimality of transferred policy versus reward difference
  • ...and 1 more figures

Theorems & Definitions (24)

  • Definition 1
  • Theorem 1
  • Theorem 2
  • proof
  • Theorem 3
  • proof
  • Lemma 1: Knaster-Tarski Fixed-Point Theorem tarski1955lattice
  • Lemma 2
  • proof
  • Lemma 3
  • ...and 14 more