Table of Contents
Fetching ...

Energy Consumption of Dataframe Libraries for End-to-End Deep Learning Pipelines:A Comparative Analysis

Punit Kumar, Asif Imran, Tevfik Kosar

TL;DR

This study addresses the energy implications of using Pandas, Polars, and Dask within end-to-end deep learning pipelines. It embeds each dataframe backend into representative ML and DL workloads, and performs a unified energy profiling using perf stat for CPU metrics and pynvml for GPU metrics across multiple datasets (Insurance, ML-1M, COCO) and models. The main contributions are the first end-to-end, CPU-and-GPU energy decomposition of these backends in training and inference, along with practical guidance: Polars generally offers better CPU-energy efficiency on larger workloads, Pandas remains competitive for small-to-moderate tasks, and Dask adds overhead except when true out-of-core or distributed scales are needed. The results inform practitioners about backend choice for energy-aware ML, supporting sustainable deployment decisions in diverse hardware settings.

Abstract

This paper presents a detailed comparative analysis of the performance of three major Python data manipulation libraries - Pandas, Polars, and Dask - specifically when embedded within complete deep learning (DL) training and inference pipelines. The research bridges a gap in existing literature by studying how these libraries interact with substantial GPU workloads during critical phases like data loading, preprocessing, and batch feeding. The authors measured key performance indicators including runtime, memory usage, disk usage, and energy consumption (both CPU and GPU) across various machine learning models and datasets.

Energy Consumption of Dataframe Libraries for End-to-End Deep Learning Pipelines:A Comparative Analysis

TL;DR

This study addresses the energy implications of using Pandas, Polars, and Dask within end-to-end deep learning pipelines. It embeds each dataframe backend into representative ML and DL workloads, and performs a unified energy profiling using perf stat for CPU metrics and pynvml for GPU metrics across multiple datasets (Insurance, ML-1M, COCO) and models. The main contributions are the first end-to-end, CPU-and-GPU energy decomposition of these backends in training and inference, along with practical guidance: Polars generally offers better CPU-energy efficiency on larger workloads, Pandas remains competitive for small-to-moderate tasks, and Dask adds overhead except when true out-of-core or distributed scales are needed. The results inform practitioners about backend choice for energy-aware ML, supporting sustainable deployment decisions in diverse hardware settings.

Abstract

This paper presents a detailed comparative analysis of the performance of three major Python data manipulation libraries - Pandas, Polars, and Dask - specifically when embedded within complete deep learning (DL) training and inference pipelines. The research bridges a gap in existing literature by studying how these libraries interact with substantial GPU workloads during critical phases like data loading, preprocessing, and batch feeding. The authors measured key performance indicators including runtime, memory usage, disk usage, and energy consumption (both CPU and GPU) across various machine learning models and datasets.

Paper Structure

This paper contains 17 sections, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: Flowchart to highlight the experimental process of this study.
  • Figure 2: Overview of our energy profiling framework. The pipeline begins with loading datasets (text, image, or audio) using different tabular backends (Polars, Pandas, or Dask), followed by data augmentation and model training/inference. At each stage, CPU energy (measured via perf stat), GPU energy (via pynvml), and memory usage (RAM/VRAM) are tracked using our integrated energy profiler. This setup allows us to quantify the energy and memory efficiency of each dataloader under a consistent workload.
  • Figure 3: Energy consumption on ML-1M dataset.
  • Figure 4: Energy consumption on Wikitext dataset.
  • Figure 5: Energy consumption on Insurance dataset.