Table of Contents
Fetching ...

Advancing Polyglot Big Data Processing using the Hadoop ecosystem

Antony Seabra, Sergio Lifschitz

TL;DR

The paper addresses the challenge of analyzing heterogeneous big data by advocating a polyglot processing approach within the Hadoop ecosystem. It surveys core Hadoop components (HDFS, YARN, Spark, Hive, HBase) and polystore concepts (Calcite, data virtualization) to enable polyglot persistence and processing across diverse data models. Through use cases in healthcare, stock markets, social networks, and smart cities, it demonstrates how multiple data stores and engines can be orchestrated to achieve scalable real-time analytics and advanced analytics. The work highlights the practical impact of polyglot Hadoop on data lakes and enterprise analytics, and outlines future directions including benchmarking across domains and developing mediator based architectures to simplify multi-store querying.

Abstract

This article explores the utilization of the Hadoop ecosystem as a polyglot big data processing platform, focusing on the integration of diverse computation and storage technologies and their potential advantages in certain computational contexts. It delves into the potential of this ecosystem as a unified platform highlighting its architectural foundations and their complementary strengths in distributed storage, processing efficiency and real-time analytics. The article explores potential use cases within domains such as Smart Cities and Social Networks, illustrating how the platform's diverse components can be orchestrated in a polyglot manner and how these fields can benefit from the ecosystem's capabilities. Finally, the article concludes by showcasing alternatives for future research, including specialized architectural aspects of the ecosystem to advance the polyglot paradigm.

Advancing Polyglot Big Data Processing using the Hadoop ecosystem

TL;DR

The paper addresses the challenge of analyzing heterogeneous big data by advocating a polyglot processing approach within the Hadoop ecosystem. It surveys core Hadoop components (HDFS, YARN, Spark, Hive, HBase) and polystore concepts (Calcite, data virtualization) to enable polyglot persistence and processing across diverse data models. Through use cases in healthcare, stock markets, social networks, and smart cities, it demonstrates how multiple data stores and engines can be orchestrated to achieve scalable real-time analytics and advanced analytics. The work highlights the practical impact of polyglot Hadoop on data lakes and enterprise analytics, and outlines future directions including benchmarking across domains and developing mediator based architectures to simplify multi-store querying.

Abstract

This article explores the utilization of the Hadoop ecosystem as a polyglot big data processing platform, focusing on the integration of diverse computation and storage technologies and their potential advantages in certain computational contexts. It delves into the potential of this ecosystem as a unified platform highlighting its architectural foundations and their complementary strengths in distributed storage, processing efficiency and real-time analytics. The article explores potential use cases within domains such as Smart Cities and Social Networks, illustrating how the platform's diverse components can be orchestrated in a polyglot manner and how these fields can benefit from the ecosystem's capabilities. Finally, the article concludes by showcasing alternatives for future research, including specialized architectural aspects of the ecosystem to advance the polyglot paradigm.

Paper Structure

This paper contains 20 sections, 11 figures.

Figures (11)

  • Figure 1: Hadoop ecosystem
  • Figure 2: HDFS architecture
  • Figure 3: Distributed Processing with MapReduce
  • Figure 4: Apache Spark Resilient Distributed Datasets
  • Figure 5: Apache Hive internal and external tables
  • ...and 6 more figures