Table of Contents
Fetching ...

TDLight: A Framework for Incremental Light Curve Management and Smart Classification

Xinghang Yu, Ce Yu, Zeguang Shao, Bin Yang

TL;DR

The paper addresses the data management bottlenecks of time-domain astronomy, where expanding light-curve volumes and the need for timely analysis clash with offline, batch-processing pipelines. It presents TDLight, a unified framework that repurposes the industrial IoT database TDengine with a one-table-per-source storage model and HEALPix indexing, integrated with the LEAVES classifier for incremental, trigger-based classification. Key contributions include high ingestion throughput (up to 954,000 rows s^-1 archival and 541,000 rows s^-1 streaming), fast cone-search performance (~50–100 ms), and validated early classification accuracy (>85% at 50% data) plus a mechanism to flag high-value candidates. The work provides a Dockerized deployment and web interface to enable practical adoption, accelerating follow-up for time-critical events and informing the design of next-generation time-domain pipelines.

Abstract

With the exponential growth of time-domain surveys, the volume of light curves has increased rapidly. However, many survey projects, such as Gaia, still rely on offline batch-processing workflows in which data are calibrated, merged, and released only after an observing phase is completed. This latency delays scientific analysis and causes many high-value transient events to be buried in archival data, missing the window for timely follow-up. While existing alert brokers handle heterogeneous data streams, it remains difficult to deploy a unified framework that combines high-performance incremental storage with real-time classification on local infrastructure. To address this challenge, we propose TDLight, a scalable system that adapts the time-series database TDengine (a high-performance IoT database) for astronomical data using a one-table-per-source schema. This architecture supports high-throughput ingestion, achieving 954,000 rows s^-1 for archived data and 541,000 rows s^-1 for incremental streams, while Hierarchical Equal Area isoLatitude Pixelization (HEALPix) indexing enables efficient cone-search queries. Building on this storage layer, we integrate the pre-trained hierarchical Random Forest classifier from the LEAVES framework to construct an incremental classification pipeline. Using the LEAVES dataset, we simulate data accumulation and evaluate a trigger-based strategy that performs early classification at specific observational milestones. In addition, by monitoring the evolution of classification probabilities, the system identifies "high-value candidates" -- sources that show high early confidence but later undergo significant label shifts. TDLight is released as an open-source Dockerized environment, providing a deployable infrastructure for next-generation time-domain surveys.

TDLight: A Framework for Incremental Light Curve Management and Smart Classification

TL;DR

The paper addresses the data management bottlenecks of time-domain astronomy, where expanding light-curve volumes and the need for timely analysis clash with offline, batch-processing pipelines. It presents TDLight, a unified framework that repurposes the industrial IoT database TDengine with a one-table-per-source storage model and HEALPix indexing, integrated with the LEAVES classifier for incremental, trigger-based classification. Key contributions include high ingestion throughput (up to 954,000 rows s^-1 archival and 541,000 rows s^-1 streaming), fast cone-search performance (~50–100 ms), and validated early classification accuracy (>85% at 50% data) plus a mechanism to flag high-value candidates. The work provides a Dockerized deployment and web interface to enable practical adoption, accelerating follow-up for time-critical events and informing the design of next-generation time-domain pipelines.

Abstract

With the exponential growth of time-domain surveys, the volume of light curves has increased rapidly. However, many survey projects, such as Gaia, still rely on offline batch-processing workflows in which data are calibrated, merged, and released only after an observing phase is completed. This latency delays scientific analysis and causes many high-value transient events to be buried in archival data, missing the window for timely follow-up. While existing alert brokers handle heterogeneous data streams, it remains difficult to deploy a unified framework that combines high-performance incremental storage with real-time classification on local infrastructure. To address this challenge, we propose TDLight, a scalable system that adapts the time-series database TDengine (a high-performance IoT database) for astronomical data using a one-table-per-source schema. This architecture supports high-throughput ingestion, achieving 954,000 rows s^-1 for archived data and 541,000 rows s^-1 for incremental streams, while Hierarchical Equal Area isoLatitude Pixelization (HEALPix) indexing enables efficient cone-search queries. Building on this storage layer, we integrate the pre-trained hierarchical Random Forest classifier from the LEAVES framework to construct an incremental classification pipeline. Using the LEAVES dataset, we simulate data accumulation and evaluate a trigger-based strategy that performs early classification at specific observational milestones. In addition, by monitoring the evolution of classification probabilities, the system identifies "high-value candidates" -- sources that show high early confidence but later undergo significant label shifts. TDLight is released as an open-source Dockerized environment, providing a deployable infrastructure for next-generation time-domain surveys.

Paper Structure

This paper contains 18 sections, 6 figures.

Figures (6)

  • Figure 1: Schematic representation of the TDLight system architecture. The framework consists of three layers: the Storage Layer (bottom) for data persistence; the Software Layer (middle) containing the core logic for ingestion, retrieval, and classification; and the User Interface Layer (top) for visualization and interaction. Arrows indicate the data flow and functional calls between modules.
  • Figure 2: Supertable schema in TDLight. Each child table represents a single astronomical object. Static metadata (Tags) including healpix_id, ra, dec, and object_class are indexed for fast filtering. Time-varying observations (Columns) like ts, mag, and flux are stored sequentially and sorted by timestamp.
  • Figure 3: Data ingestion performance benchmarks. Left panel: Archival mode throughput scaling with thread count when loading the full Gaia DR2 dataset (48.2 million records). The system achieves 954,000 rows s$^{-1}$ at 64 threads. Right panel: Streaming mode throughput measured during the ingestion of the 280 synthetic epoch catalogs. The dynamic SQL aggregation strategy sustains 541,000 rows s$^{-1}$ with sub-100ms latency, demonstrating the system's capability to handle high-frequency alert streams.
  • Figure 4: Cone search performance comparison on the Gaia DR2 benchmark dataset. Blue curve (HEALPix-indexed): Query time remains nearly constant at $\sim$50--100 ms across all search radii, limited only by disk I/O for the relevant HEALPix pixels. Red curve (Full table scan): Query time scales linearly with search area. The indexing strategy provides 300--1000$\times$ speedup depending on cone radius, making real-time spatial queries feasible for massive catalogs.
  • Figure 5: Classification performance as a function of light curve completeness. The plot demonstrates that classification accuracy (F1-score) increases monotonically with the fraction of observed data points. Different variable classes exhibit distinct early-recognition thresholds: high-amplitude periodic variables (e.g., Mira) reach high accuracy with only 30% of data, while low-amplitude variables (e.g., rotational variables) require more complete light curves. The overall accuracy exceeds 89% at 50% data completeness, validating the feasibility of early classification in streaming scenarios.
  • ...and 1 more figures