Table of Contents
Fetching ...

A Novel Approach to Translate Structural Aggregation Queries to MapReduce Code

Ahmed M. Abdelmoniem, Sameh Abdulah, Walid Atwa

TL;DR

This work addresses the productivity gap in MapReduce by introducing a translator that converts a high-level array query language (AQL) for SciDB into optimized MapReduce jobs. It focuses on three structural aggregation types—grid, sliding, and hierarchical/circular aggregations—and supports conventional and predefined aggregates, plus user-defined functions via a lightweight API. The system employs a two-stage pipeline (parsing to a query object, followed by template-based code generation) and includes performance optimizations such as array subsetting and in-mapper aggregation, with experimental results showing up to 10.84x improvements over hand-written code and scalable performance on multi-node clusters. The work demonstrates the practical impact of translating array-oriented queries into MapReduce while maintaining high performance and extensibility for domain-specific aggregations.

Abstract

Data management applications are growing and require more attention, especially in the "big data" era. Thus, supporting such applications with novel and efficient algorithms that achieve higher performance is critical. Array database management systems are one way to support these applications by dealing with data represented in n-dimensional data structures. For instance, software like SciDB and RasDaMan can be powerful tools to achieve the required performance on large-scale problems with multidimensional data. Like their relational counterparts, these management systems support specific array query languages as the user interface. As a popular programming model, MapReduce allows large-scale data analysis, facilitates query processing, and is used as a DB engine. Nevertheless, one major obstacle is the low productivity of developing MapReduce applications. Unlike high-level declarative languages such as SQL, MapReduce jobs are written in a low-level descriptive language, often requiring massive programming efforts and complicated debugging processes. This work presents a system that supports translating array queries expressed in the Array Query Language (AQL) in SciDB into MapReduce jobs. We focus on translating some unique structural aggregations, including circular, grid, hierarchical, and sliding aggregations. Unlike traditional aggregations in relational DBs, these structural aggregations are designed explicitly for array manipulation. Thus, our work can be considered an array-view counterpart of existing SQL to MapReduce translators like HiveQL and YSmart. Our translator supports structural aggregations over arrays to meet various array manipulations. The translator can also help user-defined aggregation functions with minimal user effort. We show that our translator can generate optimized MapReduce code, which performs better than the short handwritten code by up to 10.84x.

A Novel Approach to Translate Structural Aggregation Queries to MapReduce Code

TL;DR

This work addresses the productivity gap in MapReduce by introducing a translator that converts a high-level array query language (AQL) for SciDB into optimized MapReduce jobs. It focuses on three structural aggregation types—grid, sliding, and hierarchical/circular aggregations—and supports conventional and predefined aggregates, plus user-defined functions via a lightweight API. The system employs a two-stage pipeline (parsing to a query object, followed by template-based code generation) and includes performance optimizations such as array subsetting and in-mapper aggregation, with experimental results showing up to 10.84x improvements over hand-written code and scalable performance on multi-node clusters. The work demonstrates the practical impact of translating array-oriented queries into MapReduce while maintaining high performance and extensibility for domain-specific aggregations.

Abstract

Data management applications are growing and require more attention, especially in the "big data" era. Thus, supporting such applications with novel and efficient algorithms that achieve higher performance is critical. Array database management systems are one way to support these applications by dealing with data represented in n-dimensional data structures. For instance, software like SciDB and RasDaMan can be powerful tools to achieve the required performance on large-scale problems with multidimensional data. Like their relational counterparts, these management systems support specific array query languages as the user interface. As a popular programming model, MapReduce allows large-scale data analysis, facilitates query processing, and is used as a DB engine. Nevertheless, one major obstacle is the low productivity of developing MapReduce applications. Unlike high-level declarative languages such as SQL, MapReduce jobs are written in a low-level descriptive language, often requiring massive programming efforts and complicated debugging processes. This work presents a system that supports translating array queries expressed in the Array Query Language (AQL) in SciDB into MapReduce jobs. We focus on translating some unique structural aggregations, including circular, grid, hierarchical, and sliding aggregations. Unlike traditional aggregations in relational DBs, these structural aggregations are designed explicitly for array manipulation. Thus, our work can be considered an array-view counterpart of existing SQL to MapReduce translators like HiveQL and YSmart. Our translator supports structural aggregations over arrays to meet various array manipulations. The translator can also help user-defined aggregation functions with minimal user effort. We show that our translator can generate optimized MapReduce code, which performs better than the short handwritten code by up to 10.84x.

Paper Structure

This paper contains 21 sections, 15 figures, 3 tables, 4 algorithms.

Figures (15)

  • Figure 1: Grid Aggregation (GA)
  • Figure 2: Sliding Aggregation (SA)
  • Figure 3: Hierarchical Aggregation (HA)
  • Figure 4: Circular Aggregation (CA)
  • Figure 5: The AQL to MapReduce Translator Components.
  • ...and 10 more figures