Table of Contents
Fetching ...

Using LSDB to enable large-scale catalog distribution, cross-matching, and analytics

Neven Caplar, Wilson Beebe, Doug Branton, Sandro Campos, Andrew Connolly, Melissa DeLucchi, Derek Jones, Mario Juric, Jeremy Kubica, Konstantin Malanchev, Rachel Mandelbaum, Sean McGuire

TL;DR

The paper tackles the challenge of Rubin Observatory's unprecedented data volumes by introducing HATS, a healpix-based hierarchical tiling scheme for scalable, partitioned catalog storage, and LSDB, a Dask-enabled, Pandas-like analysis framework that supports spatial queries, crossmatching, and time-series analysis. HATS uses Parquet storage, metadata harmonization, and a 64-bit global index to enable efficient parallel processing of large sky catalogs, with incremental updates and compatibility enhancements. LSDB demonstrates end-user analytics on datasets like ZTF and Pan-STARRS, including time-domain pipelines built around nested-pandas representations that compactly store light curves and enable large-scale analyses. The work is augmented by active collaboration with major data centers and emphasis on cloud deployment and IVOA standardization, aiming to provide bulk access to Rubin, Euclid, and Roman data and to broaden community adoption through lsdb.io.

Abstract

The Vera C. Rubin Observatory will generate an unprecedented volume of data, including approximately 60 petabytes of raw data and around 30 trillion observed sources, posing a significant challenge for large-scale and end-user scientific analysis. As part of the LINCC Frameworks Project we are addressing these challenges with the development of the HATS (Hierarchical Adaptive Tiling Scheme) format and analysis package LSDB. HATS partitions data adaptively using a hierarchical tiling system to balance the file sizes, enabling efficient parallel analysis. Recent updates include improved metadata consistency, support for incremental updates, and enhanced compatibility with evolving datasets. LSDB complements HATS by providing a scalable, user-friendly interface for large catalog analysis, integrating spatial queries, crossmatching, and time-series tools while utilizing Dask for parallelization. We have successfully demonstrated the use of these tools with datasets such as ZTF and Pan-STARRS data releases on both cluster and cloud environments. We are deeply involved in several ongoing collaborations to ensure alignment with community needs, with future plans for IVOA standardization and support for upcoming Rubin, Euclid and Roman data. We provide our code and materials at lsdb.io.

Using LSDB to enable large-scale catalog distribution, cross-matching, and analytics

TL;DR

The paper tackles the challenge of Rubin Observatory's unprecedented data volumes by introducing HATS, a healpix-based hierarchical tiling scheme for scalable, partitioned catalog storage, and LSDB, a Dask-enabled, Pandas-like analysis framework that supports spatial queries, crossmatching, and time-series analysis. HATS uses Parquet storage, metadata harmonization, and a 64-bit global index to enable efficient parallel processing of large sky catalogs, with incremental updates and compatibility enhancements. LSDB demonstrates end-user analytics on datasets like ZTF and Pan-STARRS, including time-domain pipelines built around nested-pandas representations that compactly store light curves and enable large-scale analyses. The work is augmented by active collaboration with major data centers and emphasis on cloud deployment and IVOA standardization, aiming to provide bulk access to Rubin, Euclid, and Roman data and to broaden community adoption through lsdb.io.

Abstract

The Vera C. Rubin Observatory will generate an unprecedented volume of data, including approximately 60 petabytes of raw data and around 30 trillion observed sources, posing a significant challenge for large-scale and end-user scientific analysis. As part of the LINCC Frameworks Project we are addressing these challenges with the development of the HATS (Hierarchical Adaptive Tiling Scheme) format and analysis package LSDB. HATS partitions data adaptively using a hierarchical tiling system to balance the file sizes, enabling efficient parallel analysis. Recent updates include improved metadata consistency, support for incremental updates, and enhanced compatibility with evolving datasets. LSDB complements HATS by providing a scalable, user-friendly interface for large catalog analysis, integrating spatial queries, crossmatching, and time-series tools while utilizing Dask for parallelization. We have successfully demonstrated the use of these tools with datasets such as ZTF and Pan-STARRS data releases on both cluster and cloud environments. We are deeply involved in several ongoing collaborations to ensure alignment with community needs, with future plans for IVOA standardization and support for upcoming Rubin, Euclid and Roman data. We provide our code and materials at lsdb.io.
Paper Structure (5 sections, 2 figures)

This paper contains 5 sections, 2 figures.

Figures (2)

  • Figure 1: Partitioning of the ZTF data release 14 in the HATS scheme. Left panel: Partitioning of the object table. Right panel: Partitioning of the source table. The legend shows the healpix order of each pixel. In this Figure, "source" corresponds to each detection, while "object" is determined after associating nearby sources that likely correspond to the same astrophysical object. As such, there are more sources than objects in the catalog, necessitating deeper (higher value of healpix order) partitioning to get a similar size of all final pixels in memory.
  • Figure 2: Targeted API for the LSDB code. This example shows a hypothetical scientific analysis of selecting starts from the GAIA survey with high proper motion, cross-matched to ZTF data release. We then extract lightcurves, classify them and choose the likely RR Lyrae variable stars, before representing their distribution on the sky. LSDB aims to make such whole-sky analysis achievable and easy for science end-users.