Table of Contents
Fetching ...

ByteHouse: A Cloud-Native OLAP Engine with Incremental Computation and Multi-Modal Retrieval

Yuxing Han, Yu Lin, Yifeng Dong, Xuanhe Zhou, Xindong Peng, Xinhui Tian, Zhiyuan You, Yingzhong Guo, Xi Chen, Weiping Qu, Tao Meng, Dayue Gao, Haoyu Wang, Liuxi Wei, Huanchen Zhang, Fan Wu

TL;DR

ByteHouse presents a cloud-native data warehouse designed for real-time multimodal analytics using a shared-storage architecture. It integrates a unified storage stack with a self-describing Sniffer format, CrossCache caching, and a NexusFS compute-side filesystem, enabling low-latency point lookups and scalable scans. TheExecution Engine supports analytic, batch, and incremental processing, augmented by a Cascades-based optimizer with history-based and AI-driven optimization, and multi-modal indexing with hybrid search operators like RANK_FUSION. Empirical results on public benchmarks and ByteDance workloads show substantial latency and throughput gains, demonstrating ByteHouse as a scalable platform for next-generation intelligent data services.

Abstract

With the rapid rise of intelligent data services, modern enterprises increasingly require efficient, multimodal, and cost-effective data analytics infrastructures. However, in ByteDance's production environments, existing systems fall short due to limitations such as I/O-inefficient multimodal storage, inflexible query optimization (e.g., failing to optimize multimodal access patterns), and performance degradation caused by resource disaggregation (e.g., loss of data locality in remote storage). To address these challenges, we introduce ByteHouse (https://bytehouse.cloud), a cloud-native data warehouse designed for real-time multimodal data analytics. The storage layer integrates a unified table engine that provides a two-tier logical abstraction and physically consistent layout, SSD-backed cluster-scale cache (CrossCache) that supports shared caching across compute nodes, and virtual file system (NexusFS) that enable efficient local access on compute nodes. The compute layer supports analytical, batch, and incremental execution modes, with tailored optimizations for hybrid queries (e.g., runtime filtering over tiered vector indexes). The control layer coordinates global metadata and transactions, and features an effective optimizer enhanced by historical execution traces and AI-assisted plan selection. Evaluations on internal and standard workloads show that ByteHouse achieves significant efficiency improvement over existing systems.

ByteHouse: A Cloud-Native OLAP Engine with Incremental Computation and Multi-Modal Retrieval

TL;DR

ByteHouse presents a cloud-native data warehouse designed for real-time multimodal analytics using a shared-storage architecture. It integrates a unified storage stack with a self-describing Sniffer format, CrossCache caching, and a NexusFS compute-side filesystem, enabling low-latency point lookups and scalable scans. TheExecution Engine supports analytic, batch, and incremental processing, augmented by a Cascades-based optimizer with history-based and AI-driven optimization, and multi-modal indexing with hybrid search operators like RANK_FUSION. Empirical results on public benchmarks and ByteDance workloads show substantial latency and throughput gains, demonstrating ByteHouse as a scalable platform for next-generation intelligent data services.

Abstract

With the rapid rise of intelligent data services, modern enterprises increasingly require efficient, multimodal, and cost-effective data analytics infrastructures. However, in ByteDance's production environments, existing systems fall short due to limitations such as I/O-inefficient multimodal storage, inflexible query optimization (e.g., failing to optimize multimodal access patterns), and performance degradation caused by resource disaggregation (e.g., loss of data locality in remote storage). To address these challenges, we introduce ByteHouse (https://bytehouse.cloud), a cloud-native data warehouse designed for real-time multimodal data analytics. The storage layer integrates a unified table engine that provides a two-tier logical abstraction and physically consistent layout, SSD-backed cluster-scale cache (CrossCache) that supports shared caching across compute nodes, and virtual file system (NexusFS) that enable efficient local access on compute nodes. The compute layer supports analytical, batch, and incremental execution modes, with tailored optimizations for hybrid queries (e.g., runtime filtering over tiered vector indexes). The control layer coordinates global metadata and transactions, and features an effective optimizer enhanced by historical execution traces and AI-assisted plan selection. Evaluations on internal and standard workloads show that ByteHouse achieves significant efficiency improvement over existing systems.
Paper Structure (35 sections, 4 equations, 10 figures)

This paper contains 35 sections, 4 equations, 10 figures.

Figures (10)

  • Figure 1: ByteHouse provides real-time multimodal data analytics for over 400 external services across ByteDance Cloud.
  • Figure 2: ByteHouse Architecture.
  • Figure 3: Sniffer File Format.
  • Figure 4: Adaptive Optimization by History Executions.
  • Figure 5: Example Hybrid Data Search in ByteHouse.
  • ...and 5 more figures