Table of Contents
Fetching ...

Unsupervised embedding of trajectories captures the latent structure of scientific migration

Dakota Murray, Jisung Yoon, Sadamori Kojaku, Rodrigo Costas, Woo-Sung Jung, Staša Milojević, Yong-Yeol Ahn

Abstract

Human migration and mobility drives major societal phenomena including epidemics, economies, innovation, and the diffusion of ideas. Although human mobility and migration have been heavily constrained by geographic distance throughout the history, advances and globalization are making other factors such as language and culture increasingly more important. Advances in neural embedding models, originally designed for natural language, provide an opportunity to tame this complexity and open new avenues for the study of migration. Here, we demonstrate the ability of the model word2vec to encode nuanced relationships between discrete locations from migration trajectories, producing an accurate, dense, continuous, and meaningful vector-space representation. The resulting representation provides a functional distance between locations, as well as a digital double that can be distributed, re-used, and itself interrogated to understand the many dimensions of migration. We show that the unique power of word2vec to encode migration patterns stems from its mathematical equivalence with the gravity model of mobility. Focusing on the case of scientific migration, we apply word2vec to a database of three million migration trajectories of scientists derived from the affiliations listed on their publication records. Using techniques that leverage its semantic structure, we demonstrate that embeddings can learn the rich structure that underpins scientific migration, such as cultural, linguistic, and prestige relationships at multiple levels of granularity. Our results provide a theoretical foundation and methodological framework for using neural embeddings to represent and understand migration both within and beyond science.

Unsupervised embedding of trajectories captures the latent structure of scientific migration

Abstract

Human migration and mobility drives major societal phenomena including epidemics, economies, innovation, and the diffusion of ideas. Although human mobility and migration have been heavily constrained by geographic distance throughout the history, advances and globalization are making other factors such as language and culture increasingly more important. Advances in neural embedding models, originally designed for natural language, provide an opportunity to tame this complexity and open new avenues for the study of migration. Here, we demonstrate the ability of the model word2vec to encode nuanced relationships between discrete locations from migration trajectories, producing an accurate, dense, continuous, and meaningful vector-space representation. The resulting representation provides a functional distance between locations, as well as a digital double that can be distributed, re-used, and itself interrogated to understand the many dimensions of migration. We show that the unique power of word2vec to encode migration patterns stems from its mathematical equivalence with the gravity model of mobility. Focusing on the case of scientific migration, we apply word2vec to a database of three million migration trajectories of scientists derived from the affiliations listed on their publication records. Using techniques that leverage its semantic structure, we demonstrate that embeddings can learn the rich structure that underpins scientific migration, such as cultural, linguistic, and prestige relationships at multiple levels of granularity. Our results provide a theoretical foundation and methodological framework for using neural embeddings to represent and understand migration both within and beyond science.

Paper Structure

This paper contains 34 sections, 31 equations, 38 figures, 6 tables.

Figures (38)

  • Figure 1: Neural embedding provides functional distance that improves predictive power of the gravity model of migration best across three distinct human trajectory datasets. a. A unique identifier is assigned to each organization and they are assembled into an affiliation trajectory ordered by year of publication (top). If an author lists multiple organization affiliations within the same year, we shuffle the order within that year in each training iteration (bottom, see Supporting Information). b. Embedding distance better explains the expected flux of global scientific migration than does geographic distance (c). The red line is the line of the best fit. Black dots are mean flux across binned distances. 99% confidence intervals are plotted for the mean flux in each bin. Correlation is calculated on the data in the log-log scale ($p < 0.0001$ across all fits). The lightness of each hex bin indicates the frequency of organization pairs within it. d. Predictions of flux between institutions made using embedding distance outperform those made using geographic distance (e). Box-plots show the distribution of actual flux for binned values of predicted flux. Box color corresponds to the degree to which the distribution overlaps with $y = x$. "RMSE" is the root-mean-squared error between the actual and predicted values. Embedding distance consistently produces powerful functional distance for U.S. flight itineraries and Korean accommodation reservations (see Supporting Information).
  • Figure 2: Projection of embedding space reveals complex multi-scale structure of organizations.a. UMAP projection mcinnes2018umap of the embedding space reveals country-level clustering. Each point corresponds to an organization and its size indicates the average annual number of mobile and non-mobile authors affiliated with that organization from 2008 to 2019. Color indicates the region. The separation of organizations in Quebec and the rest of Canada is highlighted. b. Zooming into (re-projecting) the area containing countries in Western, South, and Southeast Asia shows a geographic and cultural gradient of country clusters. c. Similarly, zooming into the area containing organizations in Spain, Portugal, South, and Central America shows clustering by most widely-spoken majority language group: Spanish and Portuguese. d. Doing the same for organizations in the United States reveals geographic clustering based on state, roughly grouped by Census Bureau-designated regions, e. Zooming in further on Massachusetts reveals clusters based on urban center (Boston, Worcester), organizational sector (hospitals vs. university), and university systems and prestige (UMass system vs. Harvard, MIT).
  • Figure 3: Geography, then language, conditions international migration.a. Hierarchically clustered similarity matrix of country vectors aggregated as the mean of all organization vectors within countries with at least 25 organizations. Color of matrix cells corresponds to the cosine similarity between country vectors. Color of country names corresponds to their cluster. Color of three cell columns separated from the matrix corresponds to, from left to right, the region of the country, the language family ethnologue, and the dominant language. b. Element-centric cluster similarity gates2019element reveals the factors dictating hierarchical clustering (See methods). Region better explains the grouping of country vectors at higher levels of the clustering. Language family, and then the most widely-spoken language, better explain the fine-grained grouping of countries.
  • Figure 4: Embedding captures latent geography and prestige hierarchy.a. Comparison between the ranking of organizations in the Times ranking and the embedding ranking derived using SemAxis. Un-filled points are those top and bottom five universities used to span the axis. Even when considering only a total of ten organization vectors, the estimate of the Spearman's rank correlation between the embedding and Times ranking is $\rho = 0.73$ ($n = 145$, $p < 0.0001$), which increases when more top-and-bottom ranked universities are included (Fig. \ref{['fig:supp:semaxis_compare']}). b. The Times ranking is correlated with Leiden Ranking of U.S. universities with Spearman's $\rho = 0.87$ and $p < 0.001$. c-f. Illustration of SemAxis projection along two axes; the latent geographic axis, from California to Massachusetts (left to right) and the prestige axis. Shown for U.S. Universities (c), Regional and liberal arts colleges (d), Research institutes (e), and Government organizations (f). Full organization names are listed in Table \ref{['table:supp:orglabels']}.
  • Figure 5: Size of organization embedding vectors captures prestige and size of organizations.a. Size (L2 norm) of organization embedding vectors compared to the number of researchers for U.S. universities. Color indicates the rank of the university from the Times ranking, with 1 being the highest ranked university. Uncolored points are universities not listed on the Times ranking. A concave-shape emerges, wherein larger universities tend to be more distant from the origin (large L2 norm); however, the more prestigious universities tend to have smaller L2 norms. b. We find a similar concave-curve pattern across many countries such as the United States, China, Australia, Brazil, and others (inset, and Fig. \ref{['fig:concave30']}). Some countries exhibit variants of this pattern, such as Egypt, which is missing the right side of the curve. The loess regression lines are shown for each selected country, and for the aggregate of remaining countries, with ribbons mapping to the 99% confidence intervals based on a normal distribution. Loess lines are also shown for organizations in Australia, Brazil, and Egypt (inset).
  • ...and 33 more figures