Energy Consumption of Dataframe Libraries for End-to-End Deep Learning Pipelines:A Comparative Analysis
Punit Kumar, Asif Imran, Tevfik Kosar
TL;DR
This study addresses the energy implications of using Pandas, Polars, and Dask within end-to-end deep learning pipelines. It embeds each dataframe backend into representative ML and DL workloads, and performs a unified energy profiling using perf stat for CPU metrics and pynvml for GPU metrics across multiple datasets (Insurance, ML-1M, COCO) and models. The main contributions are the first end-to-end, CPU-and-GPU energy decomposition of these backends in training and inference, along with practical guidance: Polars generally offers better CPU-energy efficiency on larger workloads, Pandas remains competitive for small-to-moderate tasks, and Dask adds overhead except when true out-of-core or distributed scales are needed. The results inform practitioners about backend choice for energy-aware ML, supporting sustainable deployment decisions in diverse hardware settings.
Abstract
This paper presents a detailed comparative analysis of the performance of three major Python data manipulation libraries - Pandas, Polars, and Dask - specifically when embedded within complete deep learning (DL) training and inference pipelines. The research bridges a gap in existing literature by studying how these libraries interact with substantial GPU workloads during critical phases like data loading, preprocessing, and batch feeding. The authors measured key performance indicators including runtime, memory usage, disk usage, and energy consumption (both CPU and GPU) across various machine learning models and datasets.
