Table of Contents
Fetching ...

Using Text-Based Life Trajectories from Swedish Register Data to Predict Residential Mobility with Pretrained Transformers

Philipp Stark, Alexandros Sopasakis, Ola Hall, Markus Grillitsch

TL;DR

This paper tackles the challenges of high-cardinality categorical variables and shifting coding schemes in long-running administrative data by converting Swedish register codes into natural-language life trajectories. It then evaluates a range of NLP models, including compact transformers, on predicting residential mobility from 2001–2013 trajectories (with a 2013 split), finding that textual representations preserve meaningful information and that transformer-based models yield robust predictive performance even under class imbalance. The study demonstrates that textual life trajectories can outperform static baselines and that small, efficient models can achieve competitive results, offering a scalable framework for longitudinal social-science analysis. Overall, the approach enables more flexible, semantically rich modeling of life-course pathways and provides a rigorous testbed for sequence-modeling methods in harmonized register data.

Abstract

We transform large-scale Swedish register data into textual life trajectories to address two long-standing challenges in data analysis: high cardinality of categorical variables and inconsistencies in coding schemes over time. Leveraging this uniquely comprehensive population register, we convert register data from 6.9 million individuals (2001-2013) into semantically rich texts and predict individuals' residential mobility in later years (2013-2017). These life trajectories combine demographic information with annual changes in residence, work, education, income, and family circumstances, allowing us to assess how effectively such sequences support longitudinal prediction. We compare multiple NLP architectures (including LSTM, DistilBERT, BERT, and Qwen) and find that sequential and transformer-based models capture temporal and semantic structure more effectively than baseline models. The results show that textualized register data preserves meaningful information about individual pathways and supports complex, scalable modeling. Because few countries maintain longitudinal microdata with comparable coverage and precision, this dataset enables analyses and methodological tests that would be difficult or impossible elsewhere, offering a rigorous testbed for developing and evaluating new sequence-modeling approaches. Overall, our findings demonstrate that combining semantically rich register data with modern language models can substantially advance longitudinal analysis in social sciences.

Using Text-Based Life Trajectories from Swedish Register Data to Predict Residential Mobility with Pretrained Transformers

TL;DR

This paper tackles the challenges of high-cardinality categorical variables and shifting coding schemes in long-running administrative data by converting Swedish register codes into natural-language life trajectories. It then evaluates a range of NLP models, including compact transformers, on predicting residential mobility from 2001–2013 trajectories (with a 2013 split), finding that textual representations preserve meaningful information and that transformer-based models yield robust predictive performance even under class imbalance. The study demonstrates that textual life trajectories can outperform static baselines and that small, efficient models can achieve competitive results, offering a scalable framework for longitudinal social-science analysis. Overall, the approach enables more flexible, semantically rich modeling of life-course pathways and provides a rigorous testbed for sequence-modeling methods in harmonized register data.

Abstract

We transform large-scale Swedish register data into textual life trajectories to address two long-standing challenges in data analysis: high cardinality of categorical variables and inconsistencies in coding schemes over time. Leveraging this uniquely comprehensive population register, we convert register data from 6.9 million individuals (2001-2013) into semantically rich texts and predict individuals' residential mobility in later years (2013-2017). These life trajectories combine demographic information with annual changes in residence, work, education, income, and family circumstances, allowing us to assess how effectively such sequences support longitudinal prediction. We compare multiple NLP architectures (including LSTM, DistilBERT, BERT, and Qwen) and find that sequential and transformer-based models capture temporal and semantic structure more effectively than baseline models. The results show that textualized register data preserves meaningful information about individual pathways and supports complex, scalable modeling. Because few countries maintain longitudinal microdata with comparable coverage and precision, this dataset enables analyses and methodological tests that would be difficult or impossible elsewhere, offering a rigorous testbed for developing and evaluating new sequence-modeling approaches. Overall, our findings demonstrate that combining semantically rich register data with modern language models can substantially advance longitudinal analysis in social sciences.

Paper Structure

This paper contains 10 sections, 1 figure, 5 tables.

Figures (1)

  • Figure 1: t-SNE visualization (perplexity: 10) of a random subsample of 50,000 life-trajectory embeddings from the test set, created using the Qwen3 4B massive text embedding model. Points are colored by residential mobility status (mobility vs. no mobility).