Table of Contents
Fetching ...

Inference of dynamic hypergraph representations in temporal interaction data

Alec Kirkley

TL;DR

This work addresses how to represent temporal interaction data between two item categories as a sequence of temporal hypergraph snapshots by automatically selecting temporal windows using the $MDL$ principle. It proposes an $\mathcal{L}_{\text{total}}(\mathcal{X},\bm{\tau}) = \mathcal{L}_1 + \mathcal{L}_2 + \mathcal{L}_3$ encoding and solves for MDL-optimal hypergraph snapshots with an exact dynamic programming algorithm and a fast greedy method. Demonstrations on synthetic data show recovery of planted hypergraph structure under noise, and applications to NYC FourSquare checkins reveal meaningful, interpretable patterns of human mobility and activity localization. The approach provides a principled, data-driven framework for nonparametric summarization of high-order temporal interactions, with potential extensions to additional structural regularities and Bayesian formulations.

Abstract

A range of systems across the social and natural sciences generate datasets consisting of interactions between two distinct categories of items at various instances in time. Online shopping, for example, generates purchasing events of the form (user, product, time of purchase), and mutualistic interactions in plant-pollinator systems generate pollination events of the form (insect, plant, time of pollination). These data sets can be meaningfully modeled as temporal hypergraph snapshots in which multiple items within one category (i.e. online shoppers) share a hyperedge if they interacted with a common item in the other category (i.e. purchased the same product) within a given time window, allowing for the application of hypergraph analysis techniques. However, it is often unclear how to choose the number and duration of these temporal snapshots, which have a strong influence on the final hypergraph representations. Here we propose a principled nonparametric solution to this problem by extracting temporal hypergraph snapshots that optimally capture structural regularities in temporal event data according to the minimum description length principle. We demonstrate our methods on real and synthetic datasets, finding that they can recover planted artificial hypergraph structure in the presence of considerable noise and reveal meaningful activity fluctuations in human mobility data.

Inference of dynamic hypergraph representations in temporal interaction data

TL;DR

This work addresses how to represent temporal interaction data between two item categories as a sequence of temporal hypergraph snapshots by automatically selecting temporal windows using the principle. It proposes an encoding and solves for MDL-optimal hypergraph snapshots with an exact dynamic programming algorithm and a fast greedy method. Demonstrations on synthetic data show recovery of planted hypergraph structure under noise, and applications to NYC FourSquare checkins reveal meaningful, interpretable patterns of human mobility and activity localization. The approach provides a principled, data-driven framework for nonparametric summarization of high-order temporal interactions, with potential extensions to additional structural regularities and Bayesian formulations.

Abstract

A range of systems across the social and natural sciences generate datasets consisting of interactions between two distinct categories of items at various instances in time. Online shopping, for example, generates purchasing events of the form (user, product, time of purchase), and mutualistic interactions in plant-pollinator systems generate pollination events of the form (insect, plant, time of pollination). These data sets can be meaningfully modeled as temporal hypergraph snapshots in which multiple items within one category (i.e. online shoppers) share a hyperedge if they interacted with a common item in the other category (i.e. purchased the same product) within a given time window, allowing for the application of hypergraph analysis techniques. However, it is often unclear how to choose the number and duration of these temporal snapshots, which have a strong influence on the final hypergraph representations. Here we propose a principled nonparametric solution to this problem by extracting temporal hypergraph snapshots that optimally capture structural regularities in temporal event data according to the minimum description length principle. We demonstrate our methods on real and synthetic datasets, finding that they can recover planted artificial hypergraph structure in the presence of considerable noise and reveal meaningful activity fluctuations in human mobility data.
Paper Structure (10 sections, 14 equations, 11 figures, 2 tables)

This paper contains 10 sections, 14 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: Diagram of hypergraph binning method.(a) Data set $\mathcal{X}$, consisting of $N=10$ events $(\bm{x}_i=(s_i,d_i,t_i))$ involving a "source" $s_i\in\mathcal{S}$ and "destination" $d_i\in \mathcal{D}$ interacting at time $t_i$. $\mathcal{X}$ may, for example, be used to examine co-location patterns from user-location data or co-purchasing patterns among consumers in recommendation systems analysis. Time is discretized into $T$ timesteps to allow for data compression at a desired temporal resolution $\Delta t=(t_N-t_1)/T$. (b) Hypergraphs $\mathcal{G}=\{\bm{G}^{(1)},\bm{G}^{(2)}\}$ extracted from partitioning the events $\mathcal{X}$ into $K=2$ clusters $\mathcal{C}=\{\{\bm{x}_1,...,\bm{x}_6\},\{\bm{x}_7,...,\bm{x}_{10}\}\}$ with localized activity patterns. The inferred weighted hypergraphs $\bm{G}^{(k)}$ are shown in both their incidence (bipartite) representation and their standard representation, with sources $s$ mapped to nodes and destinations $d$ mapped to hyperedges. (c) Three-stage information transmission process used to design a minimum description length objective (Eq. \ref{['eq:Ltotal']}) to infer the hypergraphs $\mathcal{G}$ from event data $\mathcal{X}$. The data $\mathcal{X}$ is transmitted at increasing levels of granularity, and the optimal hypergraphs $\mathcal{G}$ (constructed using clusters $\mathcal{C}$ of events) are selected as those that minimize the description length of the transmission process.
  • Figure 2: Synthetic reconstruction performance. (a) Average inverse compression ratio (Eq. \ref{['eq:compratio']}) versus the logarithm of the planted level of cluster heterogeneity $\gamma$, for $N\in\{200,500,1000\}$ (line colors in red, blue, and yellow respectively). The exact dynamic programming algorithm results are shown with solid lines and circular markers, while the greedy algorithm results are shown with dotted lines and triangular markers. (b) Reconstruction accuracy, as quantified by the contiguity-corrected AMI (CCAMI, Eq. \ref{['eq:CCAMI']}), over the same set of experiments. Averages for each panel are taken over 30 simulations with the parameters $\{S,D,K,T\}$ described in Sec. \ref{['sec:reconstruction']}, and error bars represent 3 standard errors in the mean.
  • Figure 3: Reconstruction parameter sensitivities. (a) Average run time (in seconds) of reconstruction experiments (Fig. \ref{['fig:reconstruction1']}) versus number of time steps $T$, for both algorithms described in Sec. \ref{['sec:optimization']}. The performance of the exact dynamic programming algorithm is shown on the left axis, while that of the greedy algorithm is shown on the right axis. Regression lines of the form $\log (\text{Runtime})=\beta_1 \log (T) + \beta_2$, labeled with their least-squares estimates for the exponent $\hat{\beta}_1$, are shown as dotted lines. (b) Inverse compression ratio $\eta$ (Eq. \ref{['eq:compratio']}) versus $T$ for the experiments conducted at different values of $N$. Averages for each panel are taken over 30 simulations with each combination of the parameters $\{S,D,K,\gamma\}$ described in Sec. \ref{['sec:reconstruction']} (the averages in panel (a) also allow $N$ to vary), and error bars represent 3 standard errors in the mean.
  • Figure 4: FourSquare checkins in NYC neighborhoods. The dataset, which aggregated checkins from April 2012 to February 2013 in New York City yang2014modelingfoursquareNYC, consists of events $\bm{x}_i=(s_i,d_i,t_i)\in \mathcal{X}$ that denote a FourSquare checkin by a user $s_i$ at venue $d_i$ at time $t_i$. (a) Inferred hypergraphs for the Bay Terrace neighborhood, for which our method resulted in the highest level of data compression ($\eta = 0.68$). The hypergraphs are ordered chronologically left to right and shown in their incidence representation, with the width of edge $(s,d)$ proportional to the edge weight $G_{sd}^{(k)}$ which counts the number of events that contain user $s$ and venue $d$ in the time window. Source and destination nodes are scaled proportionally to their frequency of occurrence and labelled by unique user and venue ids respectively for each neighborhood. (b) Inferred hypergraph for Melrose, for which our method resulted in the lowest level of data compression ($\eta = 1$). (c) Histogram of the temporal event gap ratio $\alpha$ (Eq. \ref{['eq:alpha']}) for all neighborhoods with $K>1$. (d) Histogram of the edge Jensen-Shannon Divergence $\text{JSD}_{\text{Edges}}$ (Eq. \ref{['eq:JSDnorm']}) for all neighborhoods with $K>1$, with mean indicated using the dotted black line. (e) Fraction of all events (blue) and inferred temporal bin boundaries (red) that took place within each month, across all neighborhoods.
  • Figure 5: FourSquare checkins across all of NYC.(a) Binnings obtained when applying the exact dynamic programming method (top plot) and greedy agglomerative method (second plot) of Sec. \ref{['sec:optimization']} to the set of checkins aggregated across all neighborhoods in NYC, with the number of checkins for each day of the study plotted as a solid black line underneath. Colors distinguish the $K=4$ different temporal bins inferred by each of these algorithms. The bottom two plots show the partitions obtained by naïvely partitioning the events into $K=4$ time windows of equal duration and into time windows with an equal number of events (third and fourth plots respectively). (b) CCAMI matrix among all pairs of the four partitions shown in panel (a). (c) Table of summary statistics for the partitions in panel (a).
  • ...and 6 more figures