Event prediction and causality inference despite incomplete information
Harrison Lam, Yuanjie Chen, Noboru Kanazawa, Mohammad Chowdhury, Anna Battista, Stephan Waldert
TL;DR
The paper tackles predicting and explaining events in sequences when the triggering pattern is unknown and may be non-consecutive or obscured by hidden states. It combines analytical derivations of trigger-probability bounds with simulations and an attention-based ML model to identify triggers from sequences, enabling event prediction even with incomplete information. Key contributions include generalized formulas for the probability of trigger presence ($p(n)$, $P(n)$, and their extensions $p^g(n)$, $P^g(n)$), window-size and data-size prescriptions, and a deep learning pipeline that uses multiple embeddings and attention to extract the true trigger. The work provides a principled framework for complexity assessment, data planning, and interactive probing for root-cause analysis across domains such as genomics, hardware/software verification, and financial time series.
Abstract
We explored the challenge of predicting and explaining the occurrence of events within sequences of data points. Our focus was particularly on scenarios in which unknown triggers causing the occurrence of events may consist of non-consecutive, masked, noisy data points. This scenario is akin to an agent tasked with learning to predict and explain the occurrence of events without understanding the underlying processes or having access to crucial information. Such scenarios are encountered across various fields, such as genomics, hardware and software verification, and financial time series prediction. We combined analytical, simulation, and machine learning (ML) approaches to investigate, quantify, and provide solutions to this challenge. We deduced and validated equations generally applicable to any variation of the underlying challenge. Using these equations, we (1) described how the level of complexity changes with various parameters (e.g., number of apparent and hidden states, trigger length, confidence, etc.) and (2) quantified the data needed to successfully train an ML model. We then (3) proved our ML solution learns and subsequently identifies unknown triggers and predicts the occurrence of events. If the complexity of the challenge is too high, our ML solution can identify trigger candidates to be used to interactively probe the system under investigation to determine the true trigger in a way considerably more efficient than brute force methods. By sharing our findings, we aim to assist others grappling with similar challenges, enabling estimates on the complexity of their problem, the data required and a solution to solve it.
