Table of Contents
Fetching ...

Markovletics: Methods and A Novel Application for Learning Continuous-Time Markov Chain Mixtures

Fabian Spaeh, Charalampos E. Tsourakakis

TL;DR

This work addresses learning mixtures of continuous-time Markov chains (CTMCs) from trail data, a challenging problem due to discretization and latent mixture assignment. It introduces a versatile three-stage framework—discretization, soft clustering, and recovery—that can operate with continuous or discretized observations and adapts to trail-length regimes. The authors provide theoretical guidance on discretization and sample complexity, develop three practical algorithmic instantiations (GKV-ST, dEM, KTT), and validate them on synthetic data plus real-world Last.fm and NBA datasets, including a novel CTMC-based Markovletics application for analyzing basketball strategies. The results highlight regime-specific performance, scalability, and robustness, and the work contributes a reproducible toolkit for CTMC mixture learning with potential impact in social data, finance, and sports analytics.

Abstract

Sequential data naturally arises from user engagement on digital platforms like social media, music streaming services, and web navigation, encapsulating evolving user preferences and behaviors through continuous information streams. A notable unresolved query in stochastic processes is learning mixtures of continuous-time Markov chains (CTMCs). While there is progress in learning mixtures of discrete-time Markov chains with recovery guarantees [GKV16,ST23,KTT2023], the continuous scenario uncovers unique unexplored challenges. The intrigue in CTMC mixtures stems from their potential to model intricate continuous-time stochastic processes prevalent in various fields including social media, finance, and biology. In this study, we introduce a novel framework for exploring CTMCs, emphasizing the influence of observed trails' length and mixture parameters on problem regimes, which demands specific algorithms. Through thorough experimentation, we examine the impact of discretizing continuous-time trails on the learnability of the continuous-time mixture, given that these processes are often observed via discrete, resource-demanding observations. Our comparative analysis with leading methods explores sample complexity and the trade-off between the number of trails and their lengths, offering crucial insights for method selection in different problem instances. We apply our algorithms on an extensive collection of Lastfm's user-generated trails spanning three years, demonstrating the capability of our algorithms to differentiate diverse user preferences. We pioneer the use of CTMC mixtures on a basketball passing dataset to unveil intricate offensive tactics of NBA teams. This underscores the pragmatic utility and versatility of our proposed framework. All results presented in this study are replicable, and we provide the implementations to facilitate reproducibility.

Markovletics: Methods and A Novel Application for Learning Continuous-Time Markov Chain Mixtures

TL;DR

This work addresses learning mixtures of continuous-time Markov chains (CTMCs) from trail data, a challenging problem due to discretization and latent mixture assignment. It introduces a versatile three-stage framework—discretization, soft clustering, and recovery—that can operate with continuous or discretized observations and adapts to trail-length regimes. The authors provide theoretical guidance on discretization and sample complexity, develop three practical algorithmic instantiations (GKV-ST, dEM, KTT), and validate them on synthetic data plus real-world Last.fm and NBA datasets, including a novel CTMC-based Markovletics application for analyzing basketball strategies. The results highlight regime-specific performance, scalability, and robustness, and the work contributes a reproducible toolkit for CTMC mixture learning with potential impact in social data, finance, and sports analytics.

Abstract

Sequential data naturally arises from user engagement on digital platforms like social media, music streaming services, and web navigation, encapsulating evolving user preferences and behaviors through continuous information streams. A notable unresolved query in stochastic processes is learning mixtures of continuous-time Markov chains (CTMCs). While there is progress in learning mixtures of discrete-time Markov chains with recovery guarantees [GKV16,ST23,KTT2023], the continuous scenario uncovers unique unexplored challenges. The intrigue in CTMC mixtures stems from their potential to model intricate continuous-time stochastic processes prevalent in various fields including social media, finance, and biology. In this study, we introduce a novel framework for exploring CTMCs, emphasizing the influence of observed trails' length and mixture parameters on problem regimes, which demands specific algorithms. Through thorough experimentation, we examine the impact of discretizing continuous-time trails on the learnability of the continuous-time mixture, given that these processes are often observed via discrete, resource-demanding observations. Our comparative analysis with leading methods explores sample complexity and the trade-off between the number of trails and their lengths, offering crucial insights for method selection in different problem instances. We apply our algorithms on an extensive collection of Lastfm's user-generated trails spanning three years, demonstrating the capability of our algorithms to differentiate diverse user preferences. We pioneer the use of CTMC mixtures on a basketball passing dataset to unveil intricate offensive tactics of NBA teams. This underscores the pragmatic utility and versatility of our proposed framework. All results presented in this study are replicable, and we provide the implementations to facilitate reproducibility.
Paper Structure (43 sections, 11 theorems, 58 equations, 9 figures, 1 table, 1 algorithm)

This paper contains 43 sections, 11 theorems, 58 equations, 9 figures, 1 table, 1 algorithm.

Key Result

Lemma 1

Let $0<\epsilon_{\mathrm h}<1$ and fix a state $y$ and chain $\ell \in [L]$. With $c^\ell_y=\Omega(\epsilon_{\mathrm h}^{-2}\log(Ln))$ transitions, our estimator $\hat{q}^\ell_y$ for the holding time satisfies $|\hat{q}_y^\ell-q_y^\ell|\le\epsilon_{\mathrm h} q_y^\ell$ with high probability.

Figures (9)

  • Figure 1: Recovery error across different trail lengths: The plot illustrates two distinct scenarios: (a) A large number of transitions with shorter trails, and (b) a small number of transitions with long trails.
  • Figure 3: Running times for varying number of states $n$ (a), chains $L$ (b), and varying trail length (c).
  • Figure 5: Recovery error for different discretization rates $\tau$: (a) 20 samples with 25-length trails and (b) 100 samples with 200-length trails.
  • Figure 7: Sample complexity for a varying number of samples (a) and a varying discretization rate $\tau$ (b).
  • Figure 9: Classification error and assignment entropy on the Last.fm dataset.
  • ...and 4 more figures

Theorems & Definitions (18)

  • Lemma 1
  • Definition 1: Bad transition
  • Theorem 1
  • Corollary 1
  • Theorem 2
  • Theorem 3
  • Theorem 4
  • Theorem 5
  • proof
  • Lemma 2
  • ...and 8 more