Table of Contents
Fetching ...

OIL-AD: An Anomaly Detection Framework for Sequential Decision Sequences

Chen Wang, Sarah Erfani, Tansu Alpcan, Christopher Leckie

TL;DR

OIL-AD targets anomaly detection in sequential decision-making without access to rewards or environment dynamics by learning a Q function and a state-value function from normal trajectories using a transformer-based behavioural cloning framework. It introduces two core features, action optimality and sequential association, derived from the learned functions and enforced through an action-loss plus a monotonicity-loss (Spearman-based) objective, yielding robust online detection via a 2D latent space bounded by an Isolation Forest. Empirical results across real and simulated datasets show that OIL-AD outperforms a broad set of baselines, with strong improvements in F1 and clear separation between normal and anomalous trajectories in latent space. The approach is practical for real-world deployment due to its offline nature (no rewards or online environment access) and its ability to handle continuous state spaces and sequential dependencies. Future work will extend to continuous action spaces and hierarchical architectures to further enhance sequential reasoning and robustness.

Abstract

Anomaly detection in decision-making sequences is a challenging problem due to the complexity of normality representation learning and the sequential nature of the task. Most existing methods based on Reinforcement Learning (RL) are difficult to implement in the real world due to unrealistic assumptions, such as having access to environment dynamics, reward signals, and online interactions with the environment. To address these limitations, we propose an unsupervised method named Offline Imitation Learning based Anomaly Detection (OIL-AD), which detects anomalies in decision-making sequences using two extracted behaviour features: action optimality and sequential association. Our offline learning model is an adaptation of behavioural cloning with a transformer policy network, where we modify the training process to learn a Q function and a state value function from normal trajectories. We propose that the Q function and the state value function can provide sufficient information about agents' behavioural data, from which we derive two features for anomaly detection. The intuition behind our method is that the action optimality feature derived from the Q function can differentiate the optimal action from others at each local state, and the sequential association feature derived from the state value function has the potential to maintain the temporal correlations between decisions (state-action pairs). Our experiments show that OIL-AD can achieve outstanding online anomaly detection performance with up to 34.8% improvement in F1 score over comparable baselines.

OIL-AD: An Anomaly Detection Framework for Sequential Decision Sequences

TL;DR

OIL-AD targets anomaly detection in sequential decision-making without access to rewards or environment dynamics by learning a Q function and a state-value function from normal trajectories using a transformer-based behavioural cloning framework. It introduces two core features, action optimality and sequential association, derived from the learned functions and enforced through an action-loss plus a monotonicity-loss (Spearman-based) objective, yielding robust online detection via a 2D latent space bounded by an Isolation Forest. Empirical results across real and simulated datasets show that OIL-AD outperforms a broad set of baselines, with strong improvements in F1 and clear separation between normal and anomalous trajectories in latent space. The approach is practical for real-world deployment due to its offline nature (no rewards or online environment access) and its ability to handle continuous state spaces and sequential dependencies. Future work will extend to continuous action spaces and hierarchical architectures to further enhance sequential reasoning and robustness.

Abstract

Anomaly detection in decision-making sequences is a challenging problem due to the complexity of normality representation learning and the sequential nature of the task. Most existing methods based on Reinforcement Learning (RL) are difficult to implement in the real world due to unrealistic assumptions, such as having access to environment dynamics, reward signals, and online interactions with the environment. To address these limitations, we propose an unsupervised method named Offline Imitation Learning based Anomaly Detection (OIL-AD), which detects anomalies in decision-making sequences using two extracted behaviour features: action optimality and sequential association. Our offline learning model is an adaptation of behavioural cloning with a transformer policy network, where we modify the training process to learn a Q function and a state value function from normal trajectories. We propose that the Q function and the state value function can provide sufficient information about agents' behavioural data, from which we derive two features for anomaly detection. The intuition behind our method is that the action optimality feature derived from the Q function can differentiate the optimal action from others at each local state, and the sequential association feature derived from the state value function has the potential to maintain the temporal correlations between decisions (state-action pairs). Our experiments show that OIL-AD can achieve outstanding online anomaly detection performance with up to 34.8% improvement in F1 score over comparable baselines.
Paper Structure (29 sections, 17 equations, 6 figures, 5 tables, 1 algorithm)

This paper contains 29 sections, 17 equations, 6 figures, 5 tables, 1 algorithm.

Figures (6)

  • Figure 1: A demonstration of our anomaly detection method. Each decision sequence in the detection window is transformed to a novel two-dimensional feature space: action optimality and sequential association, described in Section V.D Behaviour Features for Anomaly Detection.
  • Figure 2: Method overview. In the training stage, the model is updated based on the action loss and monotonicity loss. These two training objectives contribute to the extracted features for detection - action optimality and sequential association, respectively.
  • Figure 3: Examples of generated anomalous trajectories from the Chengdu dataset in a $50*50$ grid map. The black arrows indicate normal behaviour and the red arrows indicate anomalous behaviour. $S$ and $D$ represent source and destination respectively.
  • Figure 4: Generated state values and Q values of one normal trajectory, one policy anomalous trajectory and one perturbed anomalous trajectory. Examples are from the Chengdu dataset.
  • Figure 5: 2-D latent space of three datasets. 3000 normal features (blue) are randomly sampled from the training dataset. 60 policy anomalous features (red) and 60 perturbed anomalous features (orange) are randomly sampled from anomalous trajectories. The visualization is consistent with the results in Table \ref{['tab:main results']}.
  • ...and 1 more figures

Theorems & Definitions (2)

  • proof
  • proof