Using LSDB to enable large-scale catalog distribution, cross-matching, and analytics
Neven Caplar, Wilson Beebe, Doug Branton, Sandro Campos, Andrew Connolly, Melissa DeLucchi, Derek Jones, Mario Juric, Jeremy Kubica, Konstantin Malanchev, Rachel Mandelbaum, Sean McGuire
TL;DR
The paper tackles the challenge of Rubin Observatory's unprecedented data volumes by introducing HATS, a healpix-based hierarchical tiling scheme for scalable, partitioned catalog storage, and LSDB, a Dask-enabled, Pandas-like analysis framework that supports spatial queries, crossmatching, and time-series analysis. HATS uses Parquet storage, metadata harmonization, and a 64-bit global index to enable efficient parallel processing of large sky catalogs, with incremental updates and compatibility enhancements. LSDB demonstrates end-user analytics on datasets like ZTF and Pan-STARRS, including time-domain pipelines built around nested-pandas representations that compactly store light curves and enable large-scale analyses. The work is augmented by active collaboration with major data centers and emphasis on cloud deployment and IVOA standardization, aiming to provide bulk access to Rubin, Euclid, and Roman data and to broaden community adoption through lsdb.io.
Abstract
The Vera C. Rubin Observatory will generate an unprecedented volume of data, including approximately 60 petabytes of raw data and around 30 trillion observed sources, posing a significant challenge for large-scale and end-user scientific analysis. As part of the LINCC Frameworks Project we are addressing these challenges with the development of the HATS (Hierarchical Adaptive Tiling Scheme) format and analysis package LSDB. HATS partitions data adaptively using a hierarchical tiling system to balance the file sizes, enabling efficient parallel analysis. Recent updates include improved metadata consistency, support for incremental updates, and enhanced compatibility with evolving datasets. LSDB complements HATS by providing a scalable, user-friendly interface for large catalog analysis, integrating spatial queries, crossmatching, and time-series tools while utilizing Dask for parallelization. We have successfully demonstrated the use of these tools with datasets such as ZTF and Pan-STARRS data releases on both cluster and cloud environments. We are deeply involved in several ongoing collaborations to ensure alignment with community needs, with future plans for IVOA standardization and support for upcoming Rubin, Euclid and Roman data. We provide our code and materials at lsdb.io.
