Table of Contents
Fetching ...

[Experiments \& Analysis] Hash-Based vs. Sort-Based Group-By-Aggregate: A Focused Empirical Study [Extended Version]

Gaurav Vaghasiya, Shiva Jahangiri

TL;DR

This study empirically compares sort-based and hash-based Group-By-Aggregate (GBA) methods on Apache AsterixDB to map performance trade-offs across data sizes, group counts, and data types. The experiments reveal that sort-based GBA generally benefits large datasets or downstream operations requiring sorted input, whereas hash-based GBA performs best on small datasets or when the number of groups is limited, albeit with sensitivity to memory and hashing overhead. Data-type and string-length variations further modulate performance, with strings often increasing memory usage and sort complexity, favoring hash-based methods in some scenarios. The findings provide practical guidance for GBA optimization and motivate future adaptive or hybrid approaches that switch between hash- and sort-based strategies at runtime to maximize efficiency.

Abstract

Group-by-aggregate (GBA) queries are integral to data analysis, allowing users to group data by specific attributes and apply aggregate functions such as sum, average, and count. Database Management Systems (DBMSs) typically execute GBA queries using either sort- or hash-based methods, each with unique advantages and trade-offs. Sort-based approaches are efficient for large datasets but become computationally expensive due to record comparisons, especially in cases with a small number of groups. In contrast, hash-based approaches offer faster performance in general but require significant memory and can suffer from hash collisions when handling large numbers of groups or uneven data distributions. This paper presents a focused empirical study comparing these two approaches, analyzing their strengths and weaknesses across varying data sizes, datasets, and group counts using Apache AsterixDB. Our findings indicate that sort-based methods excel in scenarios with large datasets or when subsequent operations benefit from sorted data, whereas hash-based methods are advantageous for smaller datasets or scenarios with fewer groupings. Our results provide insights into the scenarios where each method excels, offering practical guidance for optimizing GBA query performance.

[Experiments \& Analysis] Hash-Based vs. Sort-Based Group-By-Aggregate: A Focused Empirical Study [Extended Version]

TL;DR

This study empirically compares sort-based and hash-based Group-By-Aggregate (GBA) methods on Apache AsterixDB to map performance trade-offs across data sizes, group counts, and data types. The experiments reveal that sort-based GBA generally benefits large datasets or downstream operations requiring sorted input, whereas hash-based GBA performs best on small datasets or when the number of groups is limited, albeit with sensitivity to memory and hashing overhead. Data-type and string-length variations further modulate performance, with strings often increasing memory usage and sort complexity, favoring hash-based methods in some scenarios. The findings provide practical guidance for GBA optimization and motivate future adaptive or hybrid approaches that switch between hash- and sort-based strategies at runtime to maximize efficiency.

Abstract

Group-by-aggregate (GBA) queries are integral to data analysis, allowing users to group data by specific attributes and apply aggregate functions such as sum, average, and count. Database Management Systems (DBMSs) typically execute GBA queries using either sort- or hash-based methods, each with unique advantages and trade-offs. Sort-based approaches are efficient for large datasets but become computationally expensive due to record comparisons, especially in cases with a small number of groups. In contrast, hash-based approaches offer faster performance in general but require significant memory and can suffer from hash collisions when handling large numbers of groups or uneven data distributions. This paper presents a focused empirical study comparing these two approaches, analyzing their strengths and weaknesses across varying data sizes, datasets, and group counts using Apache AsterixDB. Our findings indicate that sort-based methods excel in scenarios with large datasets or when subsequent operations benefit from sorted data, whereas hash-based methods are advantageous for smaller datasets or scenarios with fewer groupings. Our results provide insights into the scenarios where each method excels, offering practical guidance for optimizing GBA query performance.

Paper Structure

This paper contains 17 sections, 14 figures.

Figures (14)

  • Figure 1: AsterixDB's Architecture
  • Figure 2: Workflow of Sort-Based Group-By for GBA queries
  • Figure 3: Execution plan for GBA in AsterixDB
  • Figure 4: Workflow of Hash-Based Group-By for GBA queries
  • Figure 5: Expr. 1 - (a) TPC-H, (b) TPC-DS. *Q18 has 2 Group-By operators.
  • ...and 9 more figures