Table of Contents
Fetching ...

What About the Data? A Mapping Study on Data Engineering for AI Systems

Petra Heck

TL;DR

This paper addresses the problem that AI systems depend on data, yet data engineering for AI has received limited attention. It uses a mapping study to analyze 25 peer-reviewed papers (2019–2023) on AI data engineering, categorizing them by life-cycle phases, proposed technical solutions, architectures, and lessons learned. The findings show a strong focus on data pipelines for training and production, with emerging but limited enterprise-wide data architectures and DataOps concepts. The work provides practitioners with a consolidated view of existing solutions and identifies research gaps, emphasizing the need for integrated, enterprise-scale data engineering frameworks and open-source tooling to advance AI engineering beyond model-centric approaches.

Abstract

AI systems cannot exist without data. Now that AI models (data science and AI) have matured and are readily available to apply in practice, most organizations struggle with the data infrastructure to do so. There is a growing need for data engineers that know how to prepare data for AI systems or that can setup enterprise-wide data architectures for analytical projects. But until now, the data engineering part of AI engineering has not been getting much attention, in favor of discussing the modeling part. In this paper we aim to change this by perform a mapping study on data engineering for AI systems, i.e., AI data engineering. We found 25 relevant papers between January 2019 and June 2023, explaining AI data engineering activities. We identify which life cycle phases are covered, which technical solutions or architectures are proposed and which lessons learned are presented. We end by an overall discussion of the papers with implications for practitioners and researchers. This paper creates an overview of the body of knowledge on data engineering for AI. This overview is useful for practitioners to identify solutions and best practices as well as for researchers to identify gaps.

What About the Data? A Mapping Study on Data Engineering for AI Systems

TL;DR

This paper addresses the problem that AI systems depend on data, yet data engineering for AI has received limited attention. It uses a mapping study to analyze 25 peer-reviewed papers (2019–2023) on AI data engineering, categorizing them by life-cycle phases, proposed technical solutions, architectures, and lessons learned. The findings show a strong focus on data pipelines for training and production, with emerging but limited enterprise-wide data architectures and DataOps concepts. The work provides practitioners with a consolidated view of existing solutions and identifies research gaps, emphasizing the need for integrated, enterprise-scale data engineering frameworks and open-source tooling to advance AI engineering beyond model-centric approaches.

Abstract

AI systems cannot exist without data. Now that AI models (data science and AI) have matured and are readily available to apply in practice, most organizations struggle with the data infrastructure to do so. There is a growing need for data engineers that know how to prepare data for AI systems or that can setup enterprise-wide data architectures for analytical projects. But until now, the data engineering part of AI engineering has not been getting much attention, in favor of discussing the modeling part. In this paper we aim to change this by perform a mapping study on data engineering for AI systems, i.e., AI data engineering. We found 25 relevant papers between January 2019 and June 2023, explaining AI data engineering activities. We identify which life cycle phases are covered, which technical solutions or architectures are proposed and which lessons learned are presented. We end by an overall discussion of the papers with implications for practitioners and researchers. This paper creates an overview of the body of knowledge on data engineering for AI. This overview is useful for practitioners to identify solutions and best practices as well as for researchers to identify gaps.
Paper Structure (39 sections, 4 figures, 5 tables)