Table of Contents
Fetching ...

A framework for statistical modelling of the extremes of longitudinal data, applied to elite swimming

Jess Spearing, Jonathan Tawn, David Irons, Tim Paulden

TL;DR

This work develops a principled framework for extreme-value analysis of longitudinal data, where many short time series are observed irregularly. A common GPD tail above a high threshold $u$ governs extremes across subjects, while a latent Gaussian process models within-subject temporal structure and between-subject heterogeneity through subject-specific attributes. Inference is Bayesian, accommodating per-subject parameters and latent-space dependence, and enables predictions of future extreme events and probabilities of breaking records, as demonstrated with elite swimmers in the men’s 100m breaststroke. The framework supports both asymptotic dependence and independence in extremes and provides a path to quantify career trajectories and predictive rankings for individuals in dynamic competitive populations.

Abstract

We develop methods, based on extreme value theory, for analysing observations in the tails of longitudinal data, i.e., a data set consisting of a large number of short time series, which are typically irregularly and non-simultaneously sampled, yet have some commonality in the structure of each series and exhibit independence between time series. Extreme value theory has not been considered previously for the unique features of longitudinal data. Across time series the data are assumed to follow a common generalised Pareto distribution, above a high threshold. To account for temporal dependence of such data we require a model to describe (i) the variation between the different time series properties, (ii) the changes in distribution over time, and (iii) the temporal dependence within each series. Our methodology has the flexibility to capture both asymptotic dependence and asymptotic independence, with this characteristic determined by the data. Bayesian inference is used given the need for inference of parameters that are unique to each time series. Our novel methodology is illustrated through the analysis of data from elite swimmers in the men's 100m breaststroke. Unlike previous analyses of personal-best data in this event, we are able to make inference about the careers of individual swimmers - such as the probability an individual will break the world record or swim the fastest time next year.

A framework for statistical modelling of the extremes of longitudinal data, applied to elite swimming

TL;DR

This work develops a principled framework for extreme-value analysis of longitudinal data, where many short time series are observed irregularly. A common GPD tail above a high threshold governs extremes across subjects, while a latent Gaussian process models within-subject temporal structure and between-subject heterogeneity through subject-specific attributes. Inference is Bayesian, accommodating per-subject parameters and latent-space dependence, and enables predictions of future extreme events and probabilities of breaking records, as demonstrated with elite swimmers in the men’s 100m breaststroke. The framework supports both asymptotic dependence and independence in extremes and provides a path to quantify career trajectories and predictive rankings for individuals in dynamic competitive populations.

Abstract

We develop methods, based on extreme value theory, for analysing observations in the tails of longitudinal data, i.e., a data set consisting of a large number of short time series, which are typically irregularly and non-simultaneously sampled, yet have some commonality in the structure of each series and exhibit independence between time series. Extreme value theory has not been considered previously for the unique features of longitudinal data. Across time series the data are assumed to follow a common generalised Pareto distribution, above a high threshold. To account for temporal dependence of such data we require a model to describe (i) the variation between the different time series properties, (ii) the changes in distribution over time, and (iii) the temporal dependence within each series. Our methodology has the flexibility to capture both asymptotic dependence and asymptotic independence, with this characteristic determined by the data. Bayesian inference is used given the need for inference of parameters that are unique to each time series. Our novel methodology is illustrated through the analysis of data from elite swimmers in the men's 100m breaststroke. Unlike previous analyses of personal-best data in this event, we are able to make inference about the careers of individual swimmers - such as the probability an individual will break the world record or swim the fastest time next year.
Paper Structure (29 sections, 34 equations, 6 figures)

This paper contains 29 sections, 34 equations, 6 figures.

Figures (6)

  • Figure 1: Data for swim-times (in seconds) plotted against the date when it was achieved for the mens' 100m breaststroke (long course) event. All competition best performances are shown for five swimmers over time. The dashed line indicates the threshold $u$.
  • Figure 2: Subject-specific posterior inferences. For the top 10 swimmers, the posteriors of these swimmers' attributes $\alpha_i$ (left) and peak ages $\tau_i$ (middle). The colours identify swimmers as defined in Figure \ref{['fig:individuals']} (left). The mean posterior and $95\%$ HPDI for the subject-specific asymptotic independence measure $\bar{\chi}_{i,\tau}$ against time lag $\tau$ in days (right).
  • Figure 3: (Left: the posterior distributions for the expected next record swim-time (blue) and ultimate swim-time (orange) for the mens' 100m breaststroke in seconds. Peaty's current record time (black vertical line). Right: posterior mean (solid line) and 95% HPDI (dashed lines) of the rate $\lambda_r(t)$ of swims by elite swimmers of beating Peaty's current record in year $t$.
  • Figure 4: Within-subject diagnostics for six top swimmers: observed swim-dates and performance in seconds (black dots); posterior predictive distributions samples (coloured dots) for the dates of their swims in the past, and for future simulated swim dates. The threshold $u$ is the horizontal line and the posterior mean and 95% HPDIs for the peak age $\tau_i$ are vertical lines.
  • Figure 5: Left: predictive probability that each swimmer will be the next swimmer in $\mathcal{I}^c$ to beat the current world record for the 10 most likely. Middle: the posterior distributions for each swimmer for the time at which they are the first the swimmers in $\mathcal{I}^c$ to beat the current record. Right: the posterior distributions of the expected PBs of all future times (vertical lines showing current PBs). Swimmers are identified from the colours in left panel.
  • ...and 1 more figures