A Survey of Pipeline Tools for Data Engineering

Anthony Mbata; Yaji Sripada; Mingjun Zhong

A Survey of Pipeline Tools for Data Engineering

Anthony Mbata, Yaji Sripada, Mingjun Zhong

TL;DR

The paper surveys data engineering pipeline tools across ETL/ELT, data integration, orchestration, and ML deployment categories, combining a methodological review with hands-on IDEAL dataset case studies. It demonstrates how tools address data wrangling, quality, and feature engineering tasks through practical pipelines using Spark, SSIS, Airflow, and TFX. A key finding is that open-source orchestrators like Apache Airflow provide scalable, extensible workflows, while Spark and SSIS offer strong ingestion and transformation capabilities, and TFX enables end-to-end ML pipelines. The work guides practitioners in selecting and composing pipelines, emphasizing reproducibility and integration through shared code and demonstrations.

Abstract

Currently, a variety of pipeline tools are available for use in data engineering. Data scientists can use these tools to resolve data wrangling issues associated with data and accomplish some data engineering tasks from data ingestion through data preparation to utilization as input for machine learning (ML). Some of these tools have essential built-in components or can be combined with other tools to perform desired data engineering operations. While some tools are wholly or partly commercial, several open-source tools are available to perform expert-level data engineering tasks. This survey examines the broad categories and examples of pipeline tools based on their design and data engineering intentions. These categories are Extract Transform Load/Extract Load Transform (ETL/ELT), pipelines for Data Integration, Ingestion, and Transformation, Data Pipeline Orchestration and Workflow Management, and Machine Learning Pipelines. The survey also provides a broad outline of the utilization with examples within these broad groups and finally, a discussion is presented with case studies indicating the usage of pipeline tools for data engineering. The studies present some first-user application experiences with sample data, some complexities of the applied pipeline, and a summary note of approaches to using these tools to prepare data for machine learning.

A Survey of Pipeline Tools for Data Engineering

TL;DR

Abstract

Paper Structure (18 sections, 7 figures, 4 tables)

This paper contains 18 sections, 7 figures, 4 tables.

Introduction
Data Engineering Pipelines
Methodology
Clusters of Data Pipelines
ETL/ELT Data Pipeline
Integration, Ingestion, and Transformation
Orchestration and Workflow Management
Machine Learning and Model Deployment
Case Study with IDEAL Household Energy Dataset
Evaluating some framework tools
Pipeline with Apache Spark
Pipeline with Microsoft SSIS
Pipeline with Apache Airflow
Pipeline with TFX (TensorFlow Extended)
Discussion of Trends in Features
...and 3 more sections

Figures (7)

Figure 1: An ideal pipeline for data engineering (adapted from nazabal2020data). The schematic points to data sourcing from multiple sources whether sourced in batches or streaming data. The data processing aspects cover the activities that occur between receiving the data and outputting the same for further use. Data processing aims to resolve all the listed data wrangling issues, however, in practice, some are resolved rather than all. Once completed the data is stored for further use or immediately consumed for visualisation and machine learning.
Figure 2: An indication of some data organisation, data integration and feature engineering capabilities of Apache Spark. In A the initial reading of the household sensor csv data with spark.read.csv shows the absence of feature names. In BStructType and StructField were used to create dataframe schema and feature names, while the data types where assigned with TimestampType() and IntegerType(). The withColumn() method was used to create a new feature csv_filename populated with the source csv filenames obtained from the use of regexp_extract and input_file_name(). Finally, these dataframes were integrated by .union.
Figure 3: Apache Spark pipeline implemented to ingest and process the IDEAL Household Energy Dataset. The import of libraries and initiation of the Spark session prepared the pipeline before downloading the data. StructField and StrucType prepare the structure of the data in preparation for ingestion with spark.read.csv. Afterward, data transformation was performed using a combination of withColumn, regexp_extract, and input_file_name. The resulting dataframes were integrated with .union to consolidate the data. The processes marked with asterisks help to view the data for better understanding without affecting the structure. Data cleaning was performed with dropDuplicates, columns renamed using withColumnRenamed, and column data altered using withColumn, when and otherwise. Some columns were split and parts were extracted. Dataframes were joined with .join, while the transformation of the data was with StringIndexer, fit, and transform to apply the index. Datetime parts were obtained before using appropriate JDBC connection and table names to write or read the dataframe into the PostgreSQL database. To write the dataframe with .write.format, .mode,.option and .save methods and later read the table using .read.format, .option and .load methods. As machine learning input, the dataframe was transformed using VectorAssembler to input the feature values into one single-named feature.
Figure 4: Microsoft SSIS Pipeline shows the processing of the metadata of the IDEAL Household Energy dataset. Connection to the "appliance" and "home" data was established with SSIS Source Assistant. Both data were sorted and later integrated with the Merge Join component of SSIS. Some data transformation and standardization were achieved with SSIS Data Conversion before selecting some columns of interest with Copy Column, these were loaded into an SQL database with the SQL Server Destination component.
Figure 5: Airflow Pipeline from data ingestion, processing, and storage. The IDEAL Household Energy Dataset sensor data for different homes was already in a PostgreSQL database. The extract_combine_tables_from_pgsql task comprises Python codes to read different sensor data from homeid 308, slice the data across a time interval, create features, assign datatypes, integrate the data, and push the data to a processing location. The split_df_for_processing task continues with the preprocessing of the data by performing some feature engineering. The split_compound_columns task processes the data by splitting, renaming, and reordering features. The combine_df_send_to_postgres_json task combines the processed data and stores the data in a local repository as CSV and JSON, and saves the final dataset in a PostgreSQL database.
...and 2 more figures

A Survey of Pipeline Tools for Data Engineering

TL;DR

Abstract

A Survey of Pipeline Tools for Data Engineering

Authors

TL;DR

Abstract

Table of Contents

Figures (7)