Table of Contents
Fetching ...

Topology-Aware Revival for Efficient Sparse Training

Meiling Jin, Fei Wang, Xiaoyun Yuan, Chen Qian, Yuan Cheng

TL;DR

This work analyzes the brittleness of static sparse training under non-stationary data distributions in reinforcement learning. It introduces Topology-Aware Revival (TAR), a lightweight one-shot post-pruning procedure that allocates a small revival budget across layers using a topology proxy and a connectivity floor, then randomly revives a subset of previously pruned connections before fixing the connectivity for the remainder of training. TAR provides theoretical motivation via a coverage bound for random revival and demonstrates empirically that it yields up to +$37.9\%$ improvements over static baselines and a median +$13.5\%$ gain over dynamic sparse training on SAC/TD3 tasks, with scalable benefits when widening networks. The approach preserves the simplicity and low overhead of SST while mitigating structural bottlenecks arising from distribution drift, making static sparse training more robust and practical for non-stationary RL scenarios.

Abstract

Static sparse training is a promising route to efficient learning by committing to a fixed mask pattern, yet the constrained structure reduces robustness. Early pruning decisions can lock the network into a brittle structure that is difficult to escape, especially in deep reinforcement learning (RL) where the evolving policy continually shifts the training distribution. We propose Topology-Aware Revival (TAR), a lightweight one-shot post-pruning procedure that improves static sparsity without dynamic rewiring. After static pruning, TAR performs a single revival step by allocating a small reserve budget across layers according to topology needs, randomly uniformly reactivating a few previously pruned connections within each layer, and then keeping the resulting connectivity fixed for the remainder of training. Across multiple continuous-control tasks with SAC and TD3, TAR improves final return over static sparse baselines by up to +37.9% and also outperforms dynamic sparse training baselines with a median gain of +13.5%.

Topology-Aware Revival for Efficient Sparse Training

TL;DR

This work analyzes the brittleness of static sparse training under non-stationary data distributions in reinforcement learning. It introduces Topology-Aware Revival (TAR), a lightweight one-shot post-pruning procedure that allocates a small revival budget across layers using a topology proxy and a connectivity floor, then randomly revives a subset of previously pruned connections before fixing the connectivity for the remainder of training. TAR provides theoretical motivation via a coverage bound for random revival and demonstrates empirically that it yields up to + improvements over static baselines and a median + gain over dynamic sparse training on SAC/TD3 tasks, with scalable benefits when widening networks. The approach preserves the simplicity and low overhead of SST while mitigating structural bottlenecks arising from distribution drift, making static sparse training more robust and practical for non-stationary RL scenarios.

Abstract

Static sparse training is a promising route to efficient learning by committing to a fixed mask pattern, yet the constrained structure reduces robustness. Early pruning decisions can lock the network into a brittle structure that is difficult to escape, especially in deep reinforcement learning (RL) where the evolving policy continually shifts the training distribution. We propose Topology-Aware Revival (TAR), a lightweight one-shot post-pruning procedure that improves static sparsity without dynamic rewiring. After static pruning, TAR performs a single revival step by allocating a small reserve budget across layers according to topology needs, randomly uniformly reactivating a few previously pruned connections within each layer, and then keeping the resulting connectivity fixed for the remainder of training. Across multiple continuous-control tasks with SAC and TD3, TAR improves final return over static sparse baselines by up to +37.9% and also outperforms dynamic sparse training baselines with a median gain of +13.5%.
Paper Structure (41 sections, 12 equations, 7 figures, 4 tables, 1 algorithm)

This paper contains 41 sections, 12 equations, 7 figures, 4 tables, 1 algorithm.

Figures (7)

  • Figure 1: Illustration of structural bottlenecks in static sparse training.(A) Limited Viewpoint: Any single pruning metric (e.g., $M_1$ or $M_2$) captures a particular viewpoint of importance, extracting a partial sparse structure ($V_1$ or $V_2$). Consequently, the resulting sparse structure represents an incomplete projection of parameter importance rather than a "wrong" one. (B) Distribution shift in RL: The policy-induced data distribution can drift from the initial stage ($d_0^{\pi_\theta}$) to the final stage ($d_T^{\pi_\theta}$). Initial fixed sparse structure $\mathcal{K}_0$, optimized solely for the early distribution, may fail to cover the requisite features for the shifted distribution, leading to mismatch risk.
  • Figure 2: Revival mitigates distribution-shift-induced structural bottlenecks.(A) Early training ($t=0$): static pruning selects a fixed sparse structure $\mathcal{K}_0$ under the early visitation distribution $d_0^{\pi_\theta}$, leaving pruned connections $\mathcal{D}$ (dashed). (B) Later training ($t=T$): as the visitation distribution drifts to $d_T^{\pi_\theta}$, the fixed sparse structure $\mathcal{K}_0$ may miss connections that become useful, creating a structural bottleneck. TAR performs a topology-aware one-shot revival that allocates a small reserve budget across layers based on connectivity needs and randomly revives a few pruned connections (green). The mask is then kept fixed, improving coverage and restoring gradient routes with minimal overhead.
  • Figure 3: Humanoid-v4 (W=256): learning curves for TAR vs. Original (TD3/SAC) across static criteria. Each colored curve corresponds to a static criterion (Magnitude/ERK/SynFlow). Dashed lines are dense references; dotted lines are DST (SET/RigL) baselines.
  • Figure 4: RQ2 (TD3, Humanoid-v4): scaling to width 1024. Blue dashed lines are dense references (W=256/1024). At W=1024 and $\alpha=0.8$, markers compare Original (circle) vs. TAR (square) for each static criterion; DST (SET/RigL) is shown for context (diamond).
  • Figure 5: RQ4 (Humanoid-v4, TD3, Magnitude): sensitivity to recovery ratio $rr$ across sparsity levels $\alpha$. Each subplot fixes $\alpha$ and varies $rr\in\{0.000,0.005,0.010,0.020\}$ (categorical, uniformly spaced). Bars and overlaid lines report final return mean over seeds.
  • ...and 2 more figures

Theorems & Definitions (1)

  • Remark 3.1: structure mismatch risk