SemBench: A Benchmark for Semantic Query Processing Engines
Jiale Lao, Andreas Zimmerer, Olga Ovcharenko, Tianji Cong, Matthew Russo, Gerardo Vitagliano, Michael Cochez, Fatma Özcan, Gautam Gupta, Thibaud Hottelier, H. V. Jagadish, Kris Kissel, Sebastian Schelter, Andreas Kipf, Immanuel Trummer
TL;DR
SemBench provides a cost-aware benchmark for semantic query processing engines that use LLMs to apply multimodal semantic operators on extended relational data. By evaluating five scenarios and 55 queries across text, image, and audio, it reveals how operator design, prompt engineering, and model choices shape latency, cost, and accuracy. The study benchmarks academia and industry SQPEs (LOTUS, Palimpzest, ThalamusDB, BigQuery) and demonstrates substantial variation in performance, especially for complex joins and cross-modal tasks, while identifying practical optimization directions. The work establishes a platform with an online leaderboard to drive progress and highlights the need for automated prompt strategies, operator fusion, and caching as future research directions.
Abstract
We present a benchmark targeting a novel class of systems: semantic query processing engines. Those systems rely inherently on generative and reasoning capabilities of state-of-the-art large language models (LLMs). They extend SQL with semantic operators, configured by natural language instructions, that are evaluated via LLMs and enable users to perform various operations on multimodal data. Our benchmark introduces diversity across three key dimensions: scenarios, modalities, and operators. Included are scenarios ranging from movie review analysis to medical question-answering. Within these scenarios, we cover different data modalities, including images, audio, and text. Finally, the queries involve a diverse set of operators, including semantic filters, joins, mappings, ranking, and classification operators. We evaluated our benchmark on three academic systems (LOTUS, Palimpzest, and ThalamusDB) and one industrial system, Google BigQuery. Although these results reflect a snapshot of systems under continuous development, our study offers crucial insights into their current strengths and weaknesses, illuminating promising directions for future research.
