Table of Contents
Fetching ...

Data-Efficient Learning of Anomalous Diffusion with Wavelet Representations: Enabling Direct Learning from Experimental Trajectories

Gongyi Wang, Yu Zhang, Zihan Huang

TL;DR

The paper tackles data scarcity in anomalous-diffusion analysis by introducing a wavelet-based trajectory representation that maps experimental trajectories to multi-channel wavelet modulus scalograms. This representation, combined with vision models, enables efficient learning directly from experimental data and reduces reliance on large simulated datasets. Across simulated benchmarks and real SPT data in F-actin networks, the approach yields superior diffusion-exponent regression and diffusion-model/mesh-size classification, notably outperforming simulation-trained baselines even with thousands rather than millions of trajectories. The authors also uncover interpretable scale fingerprints in the wavelet spectra that physically reflect diffusion mechanisms, offering avenues for segmentation and unsupervised discovery in complex transport systems.

Abstract

Machine learning (ML) has become a versatile tool for analyzing anomalous diffusion trajectories, yet most existing pipelines are trained on large collections of simulated data. In contrast, experimental trajectories, such as those from single-particle tracking (SPT), are typically scarce and may differ substantially from the idealized models used for simulation, leading to degradation or even breakdown of performance when ML methods are applied to real data. To address this mismatch, we introduce a wavelet-based representation of anomalous diffusion that enables data-efficient learning directly from experimental recordings. This representation is constructed by applying six complementary wavelet families to each trajectory and combining the resulting wavelet modulus scalograms. We first evaluate the wavelet representation on simulated trajectories from the andi-datasets benchmark, where it clearly outperforms both feature-based and trajectory-based methods with as few as 1000 training trajectories and still retains an advantage on large training sets. We then use this representation to learn directly from experimental SPT trajectories of fluorescent beads diffusing in F-actin networks, where the wavelet representation remains superior to existing alternatives for both diffusion-exponent regression and mesh-size classification. In particular, when predicting the diffusion exponents of experimental trajectories, a model trained on 1200 experimental tracks using the wavelet representation achieves significantly lower errors than state-of-the-art deep learning models trained purely on $10^6$ simulated trajectories. We associate this data efficiency with the emergence of distinct scale fingerprints disentangling underlying diffusion mechanisms in the wavelet spectra.

Data-Efficient Learning of Anomalous Diffusion with Wavelet Representations: Enabling Direct Learning from Experimental Trajectories

TL;DR

The paper tackles data scarcity in anomalous-diffusion analysis by introducing a wavelet-based trajectory representation that maps experimental trajectories to multi-channel wavelet modulus scalograms. This representation, combined with vision models, enables efficient learning directly from experimental data and reduces reliance on large simulated datasets. Across simulated benchmarks and real SPT data in F-actin networks, the approach yields superior diffusion-exponent regression and diffusion-model/mesh-size classification, notably outperforming simulation-trained baselines even with thousands rather than millions of trajectories. The authors also uncover interpretable scale fingerprints in the wavelet spectra that physically reflect diffusion mechanisms, offering avenues for segmentation and unsupervised discovery in complex transport systems.

Abstract

Machine learning (ML) has become a versatile tool for analyzing anomalous diffusion trajectories, yet most existing pipelines are trained on large collections of simulated data. In contrast, experimental trajectories, such as those from single-particle tracking (SPT), are typically scarce and may differ substantially from the idealized models used for simulation, leading to degradation or even breakdown of performance when ML methods are applied to real data. To address this mismatch, we introduce a wavelet-based representation of anomalous diffusion that enables data-efficient learning directly from experimental recordings. This representation is constructed by applying six complementary wavelet families to each trajectory and combining the resulting wavelet modulus scalograms. We first evaluate the wavelet representation on simulated trajectories from the andi-datasets benchmark, where it clearly outperforms both feature-based and trajectory-based methods with as few as 1000 training trajectories and still retains an advantage on large training sets. We then use this representation to learn directly from experimental SPT trajectories of fluorescent beads diffusing in F-actin networks, where the wavelet representation remains superior to existing alternatives for both diffusion-exponent regression and mesh-size classification. In particular, when predicting the diffusion exponents of experimental trajectories, a model trained on 1200 experimental tracks using the wavelet representation achieves significantly lower errors than state-of-the-art deep learning models trained purely on simulated trajectories. We associate this data efficiency with the emergence of distinct scale fingerprints disentangling underlying diffusion mechanisms in the wavelet spectra.

Paper Structure

This paper contains 18 sections, 5 equations, 16 figures, 2 tables.

Figures (16)

  • Figure 1: Representative example of the continuous wavelet transform applied to a 1D trajectory. (a) Standardized 1D trajectory $x(t)$ of length $L=100$. (b) Corresponding wavelet modulus scalogram $|W_\psi(a, b)|$ computed with the real Morlet wavelet over 24 discrete scales.
  • Figure 2: Schematic construction of the wavelet representation. A multidimensional trajectory is first split into its 1D components, which are standardized. Subsequently, continuous wavelet transforms with six mother wavelets on the 24-scale grid of Eq. \ref{['scaleset']} produce wavelet modulus scalograms, which are resampled to a common $256\times256$ resolution and concatenated into a $6d$-channel tensor. This wavelet representation is then used as the input to downstream supervised (vision) models.
  • Figure 3: (a) Schematic illustration of F-actin networks with mesh sizes $\xi \approx 225, 250, 300, 550, 625,$ and $750~\mathrm{nm}$ (top row), together with representative 2D bead trajectories recorded under each condition (bottom row). (b) Distributions of the diffusion exponent $\alpha$ estimated from TA-MSD fits for all trajectories at each mesh size; histograms show the counts of trajectories versus $\alpha$, and the red dashed line in each panel marks the mean exponent for that condition.
  • Figure 4: Performance comparison of wavelet-, feature-, and trajectory-based representations on 2D simulated trajectories of length $L=100$. (a)-(b) Heatmaps displaying the performance on the test set across varying training set sizes $N_{\rm train}$. (a) MAE for diffusion-exponent regression (blue scale; darker indicates lower error). (b) Micro-averaged F1 scores for diffusion-model classification (red scale; darker indicates higher F1). (c)-(d) Detailed performance comparison at the small-data regime $N_{\rm train}=1000$, highlighting a clear advantage of wavelet-based models over both feature-based and trajectory-based alternatives.
  • Figure 5: Detailed performance analysis on simulated trajectories in the small-data regime ($N_{\rm train}=1000$). (a$_1$, a$_2$) Boxplots of test MAE and F1 scores over 50 independent training subsets. Wavelet- and feature-based models exhibit high stability (tight distributions), contrasting with the high variance of trajectory-based models. (b$_1$, b$_2$) Dependence on trajectory length $L$; the wavelet representation outperforms the baselines across all lengths. (c$_1$, c$_2$) MAE and F1 scores for 1D, 2D, and 3D trajectories, confirming consistent performance rankings across spatial dimensions. (d$_1$, d$_2$) Performance versus signal-to-noise ratio (SNR); the wavelet representation demonstrates higher noise resilience compared to feature- and trajectory-based alternatives.
  • ...and 11 more figures