Table of Contents
Fetching ...

Event prediction and causality inference despite incomplete information

Harrison Lam, Yuanjie Chen, Noboru Kanazawa, Mohammad Chowdhury, Anna Battista, Stephan Waldert

TL;DR

The paper tackles predicting and explaining events in sequences when the triggering pattern is unknown and may be non-consecutive or obscured by hidden states. It combines analytical derivations of trigger-probability bounds with simulations and an attention-based ML model to identify triggers from sequences, enabling event prediction even with incomplete information. Key contributions include generalized formulas for the probability of trigger presence ($p(n)$, $P(n)$, and their extensions $p^g(n)$, $P^g(n)$), window-size and data-size prescriptions, and a deep learning pipeline that uses multiple embeddings and attention to extract the true trigger. The work provides a principled framework for complexity assessment, data planning, and interactive probing for root-cause analysis across domains such as genomics, hardware/software verification, and financial time series.

Abstract

We explored the challenge of predicting and explaining the occurrence of events within sequences of data points. Our focus was particularly on scenarios in which unknown triggers causing the occurrence of events may consist of non-consecutive, masked, noisy data points. This scenario is akin to an agent tasked with learning to predict and explain the occurrence of events without understanding the underlying processes or having access to crucial information. Such scenarios are encountered across various fields, such as genomics, hardware and software verification, and financial time series prediction. We combined analytical, simulation, and machine learning (ML) approaches to investigate, quantify, and provide solutions to this challenge. We deduced and validated equations generally applicable to any variation of the underlying challenge. Using these equations, we (1) described how the level of complexity changes with various parameters (e.g., number of apparent and hidden states, trigger length, confidence, etc.) and (2) quantified the data needed to successfully train an ML model. We then (3) proved our ML solution learns and subsequently identifies unknown triggers and predicts the occurrence of events. If the complexity of the challenge is too high, our ML solution can identify trigger candidates to be used to interactively probe the system under investigation to determine the true trigger in a way considerably more efficient than brute force methods. By sharing our findings, we aim to assist others grappling with similar challenges, enabling estimates on the complexity of their problem, the data required and a solution to solve it.

Event prediction and causality inference despite incomplete information

TL;DR

The paper tackles predicting and explaining events in sequences when the triggering pattern is unknown and may be non-consecutive or obscured by hidden states. It combines analytical derivations of trigger-probability bounds with simulations and an attention-based ML model to identify triggers from sequences, enabling event prediction even with incomplete information. Key contributions include generalized formulas for the probability of trigger presence (, , and their extensions , ), window-size and data-size prescriptions, and a deep learning pipeline that uses multiple embeddings and attention to extract the true trigger. The work provides a principled framework for complexity assessment, data planning, and interactive probing for root-cause analysis across domains such as genomics, hardware/software verification, and financial time series.

Abstract

We explored the challenge of predicting and explaining the occurrence of events within sequences of data points. Our focus was particularly on scenarios in which unknown triggers causing the occurrence of events may consist of non-consecutive, masked, noisy data points. This scenario is akin to an agent tasked with learning to predict and explain the occurrence of events without understanding the underlying processes or having access to crucial information. Such scenarios are encountered across various fields, such as genomics, hardware and software verification, and financial time series prediction. We combined analytical, simulation, and machine learning (ML) approaches to investigate, quantify, and provide solutions to this challenge. We deduced and validated equations generally applicable to any variation of the underlying challenge. Using these equations, we (1) described how the level of complexity changes with various parameters (e.g., number of apparent and hidden states, trigger length, confidence, etc.) and (2) quantified the data needed to successfully train an ML model. We then (3) proved our ML solution learns and subsequently identifies unknown triggers and predicts the occurrence of events. If the complexity of the challenge is too high, our ML solution can identify trigger candidates to be used to interactively probe the system under investigation to determine the true trigger in a way considerably more efficient than brute force methods. By sharing our findings, we aim to assist others grappling with similar challenges, enabling estimates on the complexity of their problem, the data required and a solution to solve it.
Paper Structure (28 sections, 15 equations, 8 figures, 1 table)

This paper contains 28 sections, 15 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: Left: $p(n)$ (any apparent state and any particular hidden state) vs simulations: numerical confirmation of the correctness of $p(n)$ and that simulations return the same results for different types of triggers. Right: $P(n)$ (any apparent state and any hidden but same state) vs simulations: numerical confirmation that $P_t(n)$ can serve as an estimate as well as lower bound of $P(n)$. The observation is consistent with Equation \ref{['eq:4']}.
  • Figure 2: The probabilities $p(n)$ and $P(n)$ against different sequence lengths. We can see that given the same sequence length, $P(n)\geq p(n)$
  • Figure 3: Figure on the left shows the probability of $'LLS'$ occurring in a random sequence Q plotted against $n$. For the figure on the right, we have used the generalised formed and fixed the window length $n=50$, $h=4$ and $l=3$. As we can see, even with a large window length, the sequence quickly becomes very difficult to solve.
  • Figure 4: The model architecture for trigger identification.
  • Figure 5: Results of applying the ML model to all four types of trigger sequences (scenarios in Section \ref{['ml_experiment']}). The y-axis shows how often the model paid the highest attention to each potential trigger sequence indicated on the x-axis. The highest counts were always obtained for the actual trigger sequence.
  • ...and 3 more figures