Table of Contents
Fetching ...

HoWDe: a validated algorithm for Home and Work location Detection

Sílvia De Sojo, Lorenzo Lucchini, Ollin D. Langle-Chimal, Samuel P. Fraiberger, Laura Alessandretti

TL;DR

HoWDe addresses the lack of robust, reproducible home/work detection from smartphone GPS data by introducing an open-source, modular pipeline that explicitly handles missing data and varying sampling. Ground-truth datasets (D1 and D2) validate the method, achieving up to 97.3% home and 88.1% work detection in D1 and substantial but lower accuracy in D2 due to pandemic-related mobility changes. The approach uses a sliding-window, fraction-based scheme with a small set of interpretable parameters, enabling robust detection across demographics and geographies and allowing downstream analyses of employment rates and commuting patterns. By providing validated code and privacy-preserving data practices, HoWDe promotes standardization, comparability, and responsible data sharing in human mobility research.

Abstract

Smartphone location data have become a key resource for understanding urban mobility, yet extracting actionable insights requires robust and reproducible preprocessing pipelines. A central step is the identification of individuals' home and work locations, which underpins analyses of commuting, employment, accessibility, and socioeconomic patterns. However, existing approaches are often ad hoc, data-specific, and difficult to reproduce, limiting comparability across studies and datasets. We introduce HoWDe, an open-source software library for detecting home and work locations from large-scale mobility data. HoWDe implements a transparent, modular pipeline explicitly designed to handle missing data, heterogeneous sampling rates, and differences in data sparsity across individuals. The code allows users to tune a small set of interpretable parameters, enabling to adapt the algorithm to diverse applications and datasets. Using two unique ground truth datasets comprising 5,099 individuals across 68 countries, we show that HoWDe achieves home and work detection accuracies of up to 97% and 88%, respectively, with consistent performance across demographic groups and geographic contexts. We further demonstrate how parameter settings propagate to downstream metrics such as employment estimates and commuting flows, highlighting the importance of transparent methodological choices. By providing a validated, documented, and easily deployable pipeline, HoWDe supports scalable in-house preprocessing and facilitates the sharing of privacy-preserving mobility datasets. Our software and evaluation benchmarks establish methodological standards that enhance the robustness and reproducibility of human mobility research at urban and national scales.

HoWDe: a validated algorithm for Home and Work location Detection

TL;DR

HoWDe addresses the lack of robust, reproducible home/work detection from smartphone GPS data by introducing an open-source, modular pipeline that explicitly handles missing data and varying sampling. Ground-truth datasets (D1 and D2) validate the method, achieving up to 97.3% home and 88.1% work detection in D1 and substantial but lower accuracy in D2 due to pandemic-related mobility changes. The approach uses a sliding-window, fraction-based scheme with a small set of interpretable parameters, enabling robust detection across demographics and geographies and allowing downstream analyses of employment rates and commuting patterns. By providing validated code and privacy-preserving data practices, HoWDe promotes standardization, comparability, and responsible data sharing in human mobility research.

Abstract

Smartphone location data have become a key resource for understanding urban mobility, yet extracting actionable insights requires robust and reproducible preprocessing pipelines. A central step is the identification of individuals' home and work locations, which underpins analyses of commuting, employment, accessibility, and socioeconomic patterns. However, existing approaches are often ad hoc, data-specific, and difficult to reproduce, limiting comparability across studies and datasets. We introduce HoWDe, an open-source software library for detecting home and work locations from large-scale mobility data. HoWDe implements a transparent, modular pipeline explicitly designed to handle missing data, heterogeneous sampling rates, and differences in data sparsity across individuals. The code allows users to tune a small set of interpretable parameters, enabling to adapt the algorithm to diverse applications and datasets. Using two unique ground truth datasets comprising 5,099 individuals across 68 countries, we show that HoWDe achieves home and work detection accuracies of up to 97% and 88%, respectively, with consistent performance across demographic groups and geographic contexts. We further demonstrate how parameter settings propagate to downstream metrics such as employment estimates and commuting flows, highlighting the importance of transparent methodological choices. By providing a validated, documented, and easily deployable pipeline, HoWDe supports scalable in-house preprocessing and facilitates the sharing of privacy-preserving mobility datasets. Our software and evaluation benchmarks establish methodological standards that enhance the robustness and reproducibility of human mobility research at urban and national scales.

Paper Structure

This paper contains 26 sections, 1 equation, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Behavioral visit profiles to home and work locations a. Fraction of visits per hour of the day to the annotated home locations in dataset D1, for weekdays (black) and weekends (grey) across individuals. The shaded orange area highlights the period with the highest probability of being at home. b. Fraction of visits per hour of the day to the annotated work locations in dataset 1, for weekdays and weekends. The shaded blue area highlights the period with the highest probability of being at work during weekdays. c. User-day profiles by hours at home (orange), other locations (grey), and periods without location data (light-grey) during weekdays. d. Hours at work (blue), other location (grey), or without data (light-grey), per hour of the day across days with data for weekdays. e-f. Show user-day profiles during weekends for hours at home (orange) and work (blue) respectively. The user-day profiles are sorted by cluster, hour of the first home visit, total time spent at home, and hours without data. Clusters are separated by black vertical lines, labelled with roman numerals, and the fraction of days per cluster is noted on the x-axis.
  • Figure 2: Schematics of the HoWDe algorithm workflow for home/work location detection. a. Network of visited locations. The diameter represents the number of visits. In orange, we report the stops visited during night hours, and in blue are the ones visited during work hours. b. Sequences of locations visited hourly. The pie charts display the allocation of visits across locations during the night hours (orange) and work hours (blue) on each day. Note that on Saturday, 6/01, the night hours data was discarded since there were not enough hours with available data (exemplifying the temporal coverage filter tuned by the parameter $C_{hours}$). In this minimal example, no location appears during both night and work hours; however, this may happen in real data. c. For each day, $t$ we illustrate a sliding window of $3$ consecutive days centered on $t$. For work detection, the sliding window excludes weekends. d. Aggregation step for the window centred on Friday, 5th. For each location, we compute the average fraction of time it was visited during night hours (orange) and during work hours (blue). Additionally, we compute the fraction of days the location is visited at least once during work hours. Locations are then sorted in descending order by on these fractions, and the top ones are selected as the estimated Home (L1) and Work (L2) locations for Friday.
  • Figure 3: Home and Work validation across configurations a. Detected accuracy of the home location detection for datasets D1 (first column) and D2 (second column). Each marker represents different configurations of the minimum fraction of hours an individual must be at home per night ($f_{hours, H}$). For dataset D1, we explore how the results change for each sliding window increase (x-axis), while for dataset D2, we focus on a single sliding window (comprising the entire data period). b. Fraction of not detected home locations for dataset D1 (first column) and dataset D2 (second column), sharing with the panel a) the legend. c. Detected accuracy of the work location detection for datasets D1 (first column) and D2 (second column). Each marker represents different configurations of the minimum fraction of hours an individual must be at work within the typical business hours ($f_{hours, W}$), and the minimum fraction of days an individual should visit the potential work across weekdays ($f_{days, W}$). For dataset D1, we explore how results change with increasing sliding window sizes, while for dataset D2, we keep the same sliding window size (covering the entire period). d. Fraction of not detected work locations for dataset D1 (first column) and dataset D2 (second column). The legend is shared with panel c.
  • Figure 4: Home and work detection accuracy and fraction of not detected locations across demographics for dataset D1. a. Detected accuracy of home location detection across demographic groups. Dark orange illustrates the results for the maximum configuration, while light orange illustrates the minimum. Error bars indicate bootstrapped error estimates (applies to all panels). b. Fraction of not detected home locations (configurations color coding is shared with panel a). c. Detected accuracy of work location detection across demographic groups. Dark blue illustrates the results for the maximum configuration, while light blue illustrates the minimum. d. Fraction of not detected work locations (configurations color coding is shared with panel c).
  • Figure 5: Applications for home and work detection a. Employment rate by province, as reported (left), measured using the configuration with the minimum user loss (middle), and measured with the configuration with the highest detected accuracy (right). b. Difference between the minimum (light teal) and maximum configurations (dark teal) by country, as the correlation between the reported and selected configuration (top), the relative error (bottom). c. Difference in commuting distance (km) between urban (dashed) and rural (dotted) by configuration. d. Difference in commuting distance (km) against the median and standard error of the population density bins ($\frac{people}{km^2}$).