An experimental study of existing tools for outlier detection and cleaning in trajectories
Mariana M Garcez Duarte, Mahmoud Sakr
TL;DR
This paper addresses detecting and cleaning outlier points within individual trajectories, evaluating ten open-source tools as end-user offerings. It introduces a ground-truth construction method based on cross-sensor consistency (speed and bearing) to assess detection accuracy and reports runtime and performance across four large multisensor datasets. The study surveys state-of-the-art algorithms across five method families and maps them to practical tools, enabling informed tool selection for trajectory preprocessing. Empirical results show MoveTk delivering the most reliable balance of recall and precision with low runtime, while Scikit-learn and MEOS offer robust alternatives depending on the dataset. The work also contributes a scalable ground-truth framework and public dataset/code to support reproducibility.
Abstract
Outlier detection and cleaning are essential steps in data preprocessing to ensure the integrity and validity of data analyses. This paper focuses on outlier points within individual trajectories, i.e., points that deviate significantly inside a single trajectory. We experiment with ten open-source libraries to comprehensively evaluate available tools, comparing their efficiency and accuracy in identifying and cleaning outliers. This experiment considers the libraries as they are offered to end users, with real-world applicability. We compare existing outlier detection libraries, introduce a method for establishing ground-truth, and aim to guide users in choosing the most appropriate tool for their specific outlier detection needs. Furthermore, we survey the state-of-the-art algorithms for outlier detection and classify them into five types: Statistic-based methods, Sliding window algorithms, Clustering-based methods, Graph-based methods, and Heuristic-based methods. Our research provides insights into these libraries' performance and contributes to developing data preprocessing and outlier detection methodologies.
