Table of Contents
Fetching ...

Text embedding models can be great data engineers

Iman Kazemian, Paritosh Ramanan, Murat Yildirim

TL;DR

This work introduces ADEPT, a framework that automates data engineering for time-series classification by transforming raw time-series representations into text embeddings and refining them with a variational information bottleneck before a Transformer-based classifier. By operating directly on serialized, text-like representations, ADEPT achieves robustness to missing data and irregular timestamps, while VIB improves representation quality and generalization. Across science, healthcare, finance, and IoT domains, ADEPT v2.0 consistently matches or surpasses domain-specific baselines, often with substantial gains, demonstrating that pretrained text embeddings can serve as effective, low-overhead data engineers for time-series analytics. The approach offers a scalable path to turnkey, high-performance predictive models in diverse data environments.

Abstract

Data engineering pipelines are essential - albeit costly - components of predictive analytics frameworks requiring significant engineering time and domain expertise for carrying out tasks such as data ingestion, preprocessing, feature extraction, and feature engineering. In this paper, we propose ADEPT, an automated data engineering pipeline via text embeddings. At the core of the ADEPT framework is a simple yet powerful idea that the entropy of embeddings corresponding to textually dense raw format representation of time series can be intuitively viewed as equivalent (or in many cases superior) to that of numerically dense vector representations obtained by data engineering pipelines. Consequently, ADEPT uses a two step approach that (i) leverages text embeddings to represent the diverse data sources, and (ii) constructs a variational information bottleneck criteria to mitigate entropy variance in text embeddings of time series data. ADEPT provides an end-to-end automated implementation of predictive models that offers superior predictive performance despite issues such as missing data, ill-formed records, improper or corrupted data formats and irregular timestamps. Through exhaustive experiments, we show that the ADEPT outperforms the best existing benchmarks in a diverse set of datasets from large-scale applications across healthcare, finance, science and industrial internet of things. Our results show that ADEPT can potentially leapfrog many conventional data pipeline steps thereby paving the way for efficient and scalable automation pathways for diverse data science applications.

Text embedding models can be great data engineers

TL;DR

This work introduces ADEPT, a framework that automates data engineering for time-series classification by transforming raw time-series representations into text embeddings and refining them with a variational information bottleneck before a Transformer-based classifier. By operating directly on serialized, text-like representations, ADEPT achieves robustness to missing data and irregular timestamps, while VIB improves representation quality and generalization. Across science, healthcare, finance, and IoT domains, ADEPT v2.0 consistently matches or surpasses domain-specific baselines, often with substantial gains, demonstrating that pretrained text embeddings can serve as effective, low-overhead data engineers for time-series analytics. The approach offers a scalable path to turnkey, high-performance predictive models in diverse data environments.

Abstract

Data engineering pipelines are essential - albeit costly - components of predictive analytics frameworks requiring significant engineering time and domain expertise for carrying out tasks such as data ingestion, preprocessing, feature extraction, and feature engineering. In this paper, we propose ADEPT, an automated data engineering pipeline via text embeddings. At the core of the ADEPT framework is a simple yet powerful idea that the entropy of embeddings corresponding to textually dense raw format representation of time series can be intuitively viewed as equivalent (or in many cases superior) to that of numerically dense vector representations obtained by data engineering pipelines. Consequently, ADEPT uses a two step approach that (i) leverages text embeddings to represent the diverse data sources, and (ii) constructs a variational information bottleneck criteria to mitigate entropy variance in text embeddings of time series data. ADEPT provides an end-to-end automated implementation of predictive models that offers superior predictive performance despite issues such as missing data, ill-formed records, improper or corrupted data formats and irregular timestamps. Through exhaustive experiments, we show that the ADEPT outperforms the best existing benchmarks in a diverse set of datasets from large-scale applications across healthcare, finance, science and industrial internet of things. Our results show that ADEPT can potentially leapfrog many conventional data pipeline steps thereby paving the way for efficient and scalable automation pathways for diverse data science applications.

Paper Structure

This paper contains 32 sections, 12 equations, 11 figures, 6 tables, 2 algorithms.

Figures (11)

  • Figure 1: Comparison of the model and benchmark.
  • Figure 2: Illustration of the ADEPT v2.0. Framework
  • Figure 3: t-SNE projection of segment embeddings (ADEPT v2.0) across different applications.
  • Figure 4: 3D t-SNE projection of 1536-dim segment embeddings from the PLAsTiCC-2018 LSST dataset, colored by transient class. Left: raw text embeddings; Right: embeddings after VIB filtering.
  • Figure 5: Normalized confusion matrix for the IB‐filtered pipeline on the PLAsTiCC-2018 LSST dataset. Rows correspond to true classes and columns to predicted classes; cell intensity indicates per‐class recall.
  • ...and 6 more figures