Table of Contents
Fetching ...

DuckDB on xNVMe

Marius Ottosen, Magnus Keinicke Parlo, Philippe Bonnet

TL;DR

This paper investigates direct NVMe SSD access for DuckDB by replacing the POSIX file I/O with xNVMe-based I/O to explore vertical integration with storage hardware. It compares several asynchronous I/O designs, including one-queue, multi-queue, and thread-owned queues, and demonstrates that thread-owned NVMe queues with passthrough offer the best performance improvements for large-scale scans. The results illustrate both the potential and the challenges of co-designing DuckDB's storage manager with NVMe devices, highlighting race-condition risks and the need for careful synchronization. The work points to promising directions for broader evaluation, hybrid storage stacks, and future DuckDB-SSD co-design to exploit raw device capabilities while preserving correctness guarantees.

Abstract

DuckDB is designed for portability. It is also designed to run anywhere, and possibly in contexts where it can be specialized for performance, e.g., as a cloud service or on a smart device. In this paper, we consider the way DuckDB interacts with local storage. Our long term research question is whether and how SSDs could be co-designed with DuckDB. As a first step towards vertical integration of DuckDB and programmable SSDs, we consider whether and how DuckDB can access NVMe SSDs directly. By default, DuckDB relies on the POSIX file interface. In contrast, we rely on the xNVMe library and explore how it can be leveraged in DuckDB. We leverage the block-based nature of the DuckDB buffer manager to bypass the synchronous POSIX I/O interface, the file system and the block manager. Instead, we directly issue asynchronous I/Os against the SSD logical block address space. Our preliminary experimental study compares different ways to manage asynchronous I/Os atop xNVMe. The speed-up we observe over the DuckDB baseline is significant, even for the simplest scan query over a TPC-H table. As expected, the speed-up increases with the scale factor, and the Linux NVMe passthru improves performance. Future work includes a more thorough experimental study, a flexible solution that combines raw NVMe access and legacy POSIX file interface as well the co-design of DuckDB and SSDs.

DuckDB on xNVMe

TL;DR

This paper investigates direct NVMe SSD access for DuckDB by replacing the POSIX file I/O with xNVMe-based I/O to explore vertical integration with storage hardware. It compares several asynchronous I/O designs, including one-queue, multi-queue, and thread-owned queues, and demonstrates that thread-owned NVMe queues with passthrough offer the best performance improvements for large-scale scans. The results illustrate both the potential and the challenges of co-designing DuckDB's storage manager with NVMe devices, highlighting race-condition risks and the need for careful synchronization. The work points to promising directions for broader evaluation, hybrid storage stacks, and future DuckDB-SSD co-design to exploit raw device capabilities while preserving correctness guarantees.

Abstract

DuckDB is designed for portability. It is also designed to run anywhere, and possibly in contexts where it can be specialized for performance, e.g., as a cloud service or on a smart device. In this paper, we consider the way DuckDB interacts with local storage. Our long term research question is whether and how SSDs could be co-designed with DuckDB. As a first step towards vertical integration of DuckDB and programmable SSDs, we consider whether and how DuckDB can access NVMe SSDs directly. By default, DuckDB relies on the POSIX file interface. In contrast, we rely on the xNVMe library and explore how it can be leveraged in DuckDB. We leverage the block-based nature of the DuckDB buffer manager to bypass the synchronous POSIX I/O interface, the file system and the block manager. Instead, we directly issue asynchronous I/Os against the SSD logical block address space. Our preliminary experimental study compares different ways to manage asynchronous I/Os atop xNVMe. The speed-up we observe over the DuckDB baseline is significant, even for the simplest scan query over a TPC-H table. As expected, the speed-up increases with the scale factor, and the Linux NVMe passthru improves performance. Future work includes a more thorough experimental study, a flexible solution that combines raw NVMe access and legacy POSIX file interface as well the co-design of DuckDB and SSDs.

Paper Structure

This paper contains 21 sections, 16 figures, 1 table.

Figures (16)

  • Figure 1: Example of deciding between either libaio or io_uring as the underlying I/O Interface for xNVMe.
  • Figure 2: A cleaned up UML Class Diagram of the DuckDB design/structure, only including the relevant parts to our project. Due to the complexity, we have left out the inner definitions of each class.
  • Figure 3: A high-level view of how DuckDB Blocks are managed as a single Database File, and finally stored somewhere on disk. In DuckDB, header blocks are 4096 B and data blocks are 256 KB. Logical block size on the SSD is 512B and the maximum data transfer size that we use for our mapping is 128KB.
  • Figure 4: Queries are divided into tasks for the threads to execute concurrently. They are enqueued via a task scheduler, and they are picked up by the threads by popping from the queue in a FIFO manner. A query can consist of multiple operators. In that case, it is the specific operators that are split into tasks. This step is not shown in this figure.
  • Figure 5: Illustration of our second design for modifying DuckDB, cutting out the File Layer from \ref{['fig:analysis-duckdb_file_block_structure']}.
  • ...and 11 more figures