Data driven discovery of human mobility models

Hao Guo; Weiyu Zhang; Junjie Yang; Yuanqiao Hou; Lei Dong; Yu Liu

Data driven discovery of human mobility models

Hao Guo, Weiyu Zhang, Junjie Yang, Yuanqiao Hou, Lei Dong, Yu Liu

TL;DR

This work tackles the lack of principled analytic mobility models by applying symbolic regression to multi-country mobility data, distilling interpretable expressions directly from observations. By modeling allocation weights and balancing predictive accuracy with expression complexity, the approach recovers the classic exponential-decay gravity model and reveals novel forms such as exponential-power-law distance decay, with interpretations grounded in maximum entropy. The study demonstrates geographic heterogeneity, robustness to noise, and a clear path to extending mobility models beyond traditional forms, providing a systematic framework for discovering mathematical structures in social phenomena from data. The combination of data-driven discovery and entropy-based interpretation offers a powerful tool for understanding and predicting human mobility at multiple scales.

Abstract

Human mobility is a fundamental aspect of social behavior, with broad applications in transportation, urban planning, and epidemic modeling. However, for decades new mathematical formulas to model mobility phenomena have been scarce and usually discovered by analogy to physical processes, such as the gravity model and the radiation model. These sporadic discoveries are often thought to rely on intuition and luck in fitting empirical data. Here, we propose a systematic approach that leverages symbolic regression to automatically discover interpretable models from human mobility data. Our approach finds several well-known formulas, such as the distance decay effect and classical gravity models, as well as previously unknown ones, such as an exponential-power-law decay that can be explained by the maximum entropy principle. By relaxing the constraints on the complexity of model expressions, we further show how key variables of human mobility are progressively incorporated into the model, making this framework a powerful tool for revealing the underlying mathematical structures of complex social phenomena directly from observational data.

Data driven discovery of human mobility models

TL;DR

Abstract

Paper Structure (8 sections, 6 equations, 11 figures, 8 tables)

This paper contains 8 sections, 6 equations, 11 figures, 8 tables.

Research design
Discovering symbolic models of mobility flows
Geographical heterogeneity of mobility models
Stability of Symbolic Regression under noise
Empirical data.
Simulated data.
Symbolic Regression.
Maximum entropy approach to the gravity model.

Figures (11)

Figure 1: The analytical framework of mobility model distillation. (a) Mobility flow of Guangdong, China. The flow volume $F_{ij}$ from origin $i$ to destination $j$ is the response variable. (b) The explanatory variables include the workplace population $w$, the residential population $r$, geographic distance $d_{ij}$, and intervening opportunities $s_w,s_r$, calculated with workplace and residential population, respectively. (c) The overall workflow to automatically distill models from mobility data. Starting with response variables and explanatory variables along with seven common operators, we use a genetic-programming-based SR program to search for appropriate model forms. In each iteration, the SR program generates models for allocation weight function $f$. For each origin $i$, the total outflow $O_i$ is allocated to each destination based on the corresponding allocation weights $f_{ij}$. The MSE between the allocated flow $\hat{F}_{ij}$ and the actual flow $F_{ij}$ is then calculated and fed back into the SR program for expression optimization.
Figure 1: Examples for calculating the model complexity with binary expression trees. (a) Under the allocation weight setting, the gravity model with power-law decay $f_{ij}=m_j/d_{ij}^\beta$ is expressed as a tree with 5 nodes, hence has a complexity of 5. (b) The radiation model $f_{ij}=m_j/(m_i+s_{ij})(m_i+s_{ij}+m_j)$ has a complexity of 11.
Figure 2: SR results on mobility flow data. (a-c) Pareto frontiers of SR models on Guangdong, England, and US datasets. As flow magnitudes vary across datasets, we normalize the RMSE with that of the simplest gravity model ($m_j/d_{ij}$). The accuracy and complexity of six existing models are marked with crosses (note that some existing models are not shown as their errors exceed the range of the y-axis). Expressions with complexity 1 are excluded as one variable or constant is not sufficient to model the mobility flow. Across all datasets and complexity levels, the SR models consistently outperform or match the existing models in accuracy. (d) Expressions of the Pareto optimal SR models. The notations follow Fig. \ref{['fig:framework']}, except that $d$ is short for $d_{ij}$ and $w,r$ for $w_j, r_j$ (as origin populations are not present in these formulas). To align with existing models, some expressions are not given in the form with the lowest complexity. The distilled models are classified based on the captured effects governing human mobility. (e) Expressions of the considered existing models. GMZipf: Zipf's simple gravity modelZipf46; GMPow/GMExp: gravity model with power-law/exponential decayWil70; OPS: opportunity priority selection modelLiYa19; RM: radiation modelSGM12; IO: Schneider's intervening opportunity modelSch59.
Figure 2: SR results on Guangdong (county level) and Beijing-Tianjin-Hebei (BTH) data. (a-b) Pareto frontiers of SR models on Guangdong and BTH data. As in Fig. \ref{['fig:srmain']}, the RMSEs are normalized with that of the simplest gravity model ($m_j/d_{ij}$), and the crosses represent the accuracy and complexity of six existing models (Fig. \ref{['fig:srmain']}e). (c) Expressions of the Pareto optimal SR models. The notations follow Fig. \ref{['fig:framework']}, except that $d$ is short for $d_{ij}$. The distilled models are classified based on the captured effects governing human mobility. A decay function of intervening opportunity is viewed as a special form of distance decay.
Figure 3: Spatial heterogeneity of the mobility model across US. (a) The distance distribution of commuting flows, grouped by geographic regions. The predicted flows are from the complexity 5 SR model on each subset grouped by the origin and destination region. For inter-region flows, each subplot shows outflows from one region, and the line color corresponds to the destination region. (b) SR models at complexity 5 for commuting flows. Each row/column represents the region containing the residential/work place. The notations in formulas are the same as in Fig. \ref{['fig:framework']}. The accuracy of these models is compared with one of the power-law and exponential gravity models which produces lower error on the subset of flow data. A red/blue cell indicates the accuracy of the SR model is better/worse (measured with RMSE); in such cases, the gravity model is given in parentheses. A grey cell means the SR model is identical to the gravity model.
...and 6 more figures

Data driven discovery of human mobility models

TL;DR

Abstract

Data driven discovery of human mobility models

Authors

TL;DR

Abstract

Table of Contents

Figures (11)