Table of Contents
Fetching ...

Pi-transformer: A prior-informed dual-attention model for multivariate time-series anomaly detection

Sepehr Maleki, Negar Pourmoazemi

Abstract

Anomalies in multivariate time series often arise from temporal context and cross-channel coordination rather than isolated outliers. We present Pi-Transformer (Prior-Informed Transformer), a transformer with two attention pathways: data-driven series attention and a smoothly evolving prior attention that encodes temporal invariants such as scale-related self-similarity and phase synchrony. The prior provides an amplitude-insensitive temporal reference that calibrates reconstruction error. During training, we pair a reconstruction objective with a divergence term that encourages agreement between the two attentions while keeping them meaningfully distinct. The prior is regularised to evolve smoothly and is lightly distilled towards dataset-level statistics. At inference, the model combines an alignment-weighted reconstruction signal (Energy) with a mismatch signal that highlights timing and phase disruptions, and fuses them into a single score for detection. Across five benchmarks (SMD, MSL, SMAP, SWaT, and PSM), Pi-Transformer achieves state-of-the-art or highly competitive F1, with particular strength on timing and phase-breaking anomalies. Case analyses show complementary behaviour of the two streams and interpretable detections around regime changes. Embedding prior attention into transformer scoring yields a calibrated and robust approach to anomaly detection in complex multivariate systems.

Pi-transformer: A prior-informed dual-attention model for multivariate time-series anomaly detection

Abstract

Anomalies in multivariate time series often arise from temporal context and cross-channel coordination rather than isolated outliers. We present Pi-Transformer (Prior-Informed Transformer), a transformer with two attention pathways: data-driven series attention and a smoothly evolving prior attention that encodes temporal invariants such as scale-related self-similarity and phase synchrony. The prior provides an amplitude-insensitive temporal reference that calibrates reconstruction error. During training, we pair a reconstruction objective with a divergence term that encourages agreement between the two attentions while keeping them meaningfully distinct. The prior is regularised to evolve smoothly and is lightly distilled towards dataset-level statistics. At inference, the model combines an alignment-weighted reconstruction signal (Energy) with a mismatch signal that highlights timing and phase disruptions, and fuses them into a single score for detection. Across five benchmarks (SMD, MSL, SMAP, SWaT, and PSM), Pi-Transformer achieves state-of-the-art or highly competitive F1, with particular strength on timing and phase-breaking anomalies. Case analyses show complementary behaviour of the two streams and interpretable detections around regime changes. Embedding prior attention into transformer scoring yields a calibrated and robust approach to anomaly detection in complex multivariate systems.

Paper Structure

This paper contains 12 sections, 22 equations, 9 figures, 7 tables, 2 algorithms.

Figures (9)

  • Figure 1: The Pi-Transformer architecture.
  • Figure 2: Canonical anomaly types. Each column shows (top to bottom): observed series with ground-truth anomalies (red), series--prior mismatch $\Delta_i$ (raw), Energy $e_i=w_ir_i$ (raw), and the fused score $f_i=\max(\widetilde{e}_i,\widetilde{d}_i)$ (unit-scaled). A single per-dataset threshold $\eta_{\mathrm{thr}}$ (dashed) is applied across all types. Orange shading marks detected regions where $f_i>\eta_{\mathrm{thr}}$. Amplitude or shape anomalies (point, contextual, collective) drive Energy spikes with low $\Delta$, while timing and phase anomalies (seasonal, trend) elevate $\Delta$ at breakpoints even when Energy is muted.
  • Figure 3: Illustrative mechanism around a phase-breaking anomaly. (a) Time series with anomalous interval (red). (b--e) Prior vs. series attentions in nominal and anomalous regimes (zoomed). (f) Series--prior mismatch $\Delta_i$, (g) Energy $e_i=w_ir_i$, and (h) fused score $f_i=\max(\widetilde{e}_i,\widetilde{d}_i)$. $\Delta$ spikes at the onset (alignment collapse), while Energy peaks immediately before or after where error is high but alignment is non-zero. The fused score rises as the anomaly enters the rolling window, providing robust detection.
  • Figure 4: PSM example (input window). Twenty-five observed channels within a representative window from the PSM dataset. Two ground-truth anomalous segments are shaded. Several channels show clear deviations aligned with the later regime, while others remain near baseline.
  • Figure 5: PSM example (attention maps). Prior and series attention under a clean window (left) and an anomalous window (middle), with the signed difference (anomaly$-$clean; right). Top row: prior attention. Bottom row: series attention. The anomalous regime induces a clearer reallocation in the series pathway, while the prior remains comparatively stable.
  • ...and 4 more figures