Human Mobility Datasets Enriched With Contextual and Social Dimensions
Chiara Pugliese, Francesco Lettich, Guido Rocchietti, Chiara Renso, Fabio Pinelli
TL;DR
The paper tackles the scarcity of publicly available semantically enriched human mobility data by releasing two real GPS-based datasets (Paris and NYC) enriched with stops, POIs, weather, inferred transportation modes, and synthetic LLM-generated social media posts. It provides a reproducible pipeline that outputs both tabular and RDF representations to support semantic reasoning and FAIR data practices. The work details data collection, preprocessing, semantic enrichment, and synthetic data generation, enabling tasks from behavior modeling to LLM-enabled multimodal mobility analyses. By integrating real movement with synthetic text and semantic web representations, the resource enables interpretable, multimodal urban analytics and serves as a benchmark for future mobility and knowledge-management research.
Abstract
In this resource paper, we present two publicly available datasets of semantically enriched human trajectories, together with the pipeline to build them. The trajectories are publicly available GPS traces retrieved from OpenStreetMap. Each dataset includes contextual layers such as stops, moves, points of interest (POIs), inferred transportation modes, and weather data. A novel semantic feature is the inclusion of synthetic, realistic social media posts generated by Large Language Models (LLMs), enabling multimodal and semantic mobility analysis. The datasets are available in both tabular and Resource Description Framework (RDF) formats, supporting semantic reasoning and FAIR data practices. They cover two structurally distinct, large cities: Paris and New York. Our open source reproducible pipeline allows for dataset customization, while the datasets support research tasks such as behavior modeling, mobility prediction, knowledge graph construction, and LLM-based applications. To our knowledge, our resource is the first to combine real-world movement, structured semantic enrichment, LLM-generated text, and semantic web compatibility in a reusable framework.
