Table of Contents
Fetching ...

Dynamic Reward Scaling for Multivariate Time Series Anomaly Detection: A VAE-Enhanced Reinforcement Learning Approach

Bahareh Golchin, Banafsheh Rekabdar

TL;DR

DRSMT tackles multivariate time series anomaly detection under limited labeled data by fusing a Variational Autoencoder, an LSTM‑driven DQN, dynamic reward scaling, and an active learning loop. The VAE learns compact latent representations and provides a reconstruction‑based intrinsic reward $R_2$ that complements the classification reward $R_1$, with a dynamic coefficient $\lambda(t)$ balancing exploration and exploitation via $R_{total}=R_1+\lambda(t)R_2$. The LSTM‑DQN makes sequential binary decisions on normal vs. anomalous windows, guided by the unified reward and enriched by uncertainty‑driven labeling. Experiments on SMD and WADI demonstrate state‑of‑the‑art F1 and AU‑PR, confirming the practicality of combining generative modeling, reinforcement learning, and selective supervision for scalable industrial anomaly detection. The method shows strong precision on high‑dimensional sensor data and demonstrates how adaptive rewards and minimal labeling can reduce supervision while maintaining accuracy.

Abstract

Detecting anomalies in multivariate time series is essential for monitoring complex industrial systems, where high dimensionality, limited labeled data, and subtle dependencies between sensors cause significant challenges. This paper presents a deep reinforcement learning framework that combines a Variational Autoencoder (VAE), an LSTM-based Deep Q-Network (DQN), dynamic reward shaping, and an active learning module to address these issues in a unified learning framework. The main contribution is the implementation of Dynamic Reward Scaling for Multivariate Time Series Anomaly Detection (DRSMT), which demonstrates how each component enhances the detection process. The VAE captures compact latent representations and reduces noise. The DQN enables adaptive, sequential anomaly classification, and the dynamic reward shaping balances exploration and exploitation during training by adjusting the importance of reconstruction and classification signals. In addition, active learning identifies the most uncertain samples for labeling, reducing the need for extensive manual supervision. Experiments on two multivariate benchmarks, namely Server Machine Dataset (SMD) and Water Distribution Testbed (WADI), show that the proposed method outperforms existing baselines in F1-score and AU-PR. These results highlight the effectiveness of combining generative modeling, reinforcement learning, and selective supervision for accurate and scalable anomaly detection in real-world multivariate systems.

Dynamic Reward Scaling for Multivariate Time Series Anomaly Detection: A VAE-Enhanced Reinforcement Learning Approach

TL;DR

DRSMT tackles multivariate time series anomaly detection under limited labeled data by fusing a Variational Autoencoder, an LSTM‑driven DQN, dynamic reward scaling, and an active learning loop. The VAE learns compact latent representations and provides a reconstruction‑based intrinsic reward that complements the classification reward , with a dynamic coefficient balancing exploration and exploitation via . The LSTM‑DQN makes sequential binary decisions on normal vs. anomalous windows, guided by the unified reward and enriched by uncertainty‑driven labeling. Experiments on SMD and WADI demonstrate state‑of‑the‑art F1 and AU‑PR, confirming the practicality of combining generative modeling, reinforcement learning, and selective supervision for scalable industrial anomaly detection. The method shows strong precision on high‑dimensional sensor data and demonstrates how adaptive rewards and minimal labeling can reduce supervision while maintaining accuracy.

Abstract

Detecting anomalies in multivariate time series is essential for monitoring complex industrial systems, where high dimensionality, limited labeled data, and subtle dependencies between sensors cause significant challenges. This paper presents a deep reinforcement learning framework that combines a Variational Autoencoder (VAE), an LSTM-based Deep Q-Network (DQN), dynamic reward shaping, and an active learning module to address these issues in a unified learning framework. The main contribution is the implementation of Dynamic Reward Scaling for Multivariate Time Series Anomaly Detection (DRSMT), which demonstrates how each component enhances the detection process. The VAE captures compact latent representations and reduces noise. The DQN enables adaptive, sequential anomaly classification, and the dynamic reward shaping balances exploration and exploitation during training by adjusting the importance of reconstruction and classification signals. In addition, active learning identifies the most uncertain samples for labeling, reducing the need for extensive manual supervision. Experiments on two multivariate benchmarks, namely Server Machine Dataset (SMD) and Water Distribution Testbed (WADI), show that the proposed method outperforms existing baselines in F1-score and AU-PR. These results highlight the effectiveness of combining generative modeling, reinforcement learning, and selective supervision for accurate and scalable anomaly detection in real-world multivariate systems.

Paper Structure

This paper contains 21 sections, 10 equations, 3 figures, 2 tables, 1 algorithm.

Figures (3)

  • Figure 1: Workflow of our proposed method (DRSMT). A multivariate $N_{\mathrm{steps}}\times M$ sliding window (with $M$ sensor channels) is fed in parallel to: 1) a VAE that produces a reconstruction error penalty $R_{2}$, and 2) an LSTM‐based DQN that outputs classification rewards $R_{1}$. The Dynamic Reward module combines $R_{1}$ and $R_{2}$ with an adaptive coefficient $\lambda(t)$, which is automatically updated during training to trade off exploration (novelty via reconstruction error) and exploitation (correct classification). An Active Learning loop queries the most uncertain windows for human labeling, closing the loop with minimal labeled data.
  • Figure 2: Relationship between the dynamic coefficient and reward evolution during training.
  • Figure 3: Example visualizations of SMD anomaly detection results from the proposed method.