Semantic SQL -- Combining and optimizing semantic predicates in SQL
Akash Mittal, Anshul Bheemreddy, Huili Tao
TL;DR
This paper tackles the challenge of querying unstructured data using both semantic understanding and precise structured predicates. It introduces Semantic SQL (SSQL), extending SQL with a SEMANTIC predicate and jointly optimizing semantic and structured predicates by storing ML results and metadata in a relational DB and using a vector store with CLIP embeddings managed by FAISS, with similarity in the embedding space measured by $d(a,b)$ after normalization. A human-in-the-loop calibrates the similarity threshold to determine which semantic results are returned, enabling complete result sets where semantic-only approaches fall short. Experiments on COCO show that semantic-only queries fail for count and spatial queries, while the joint approach improves correctness; the work also provides open-source tooling for reproducibility. Overall, this work advances practical analytics over multimodal data by enabling integrated semantic and structured querying with feedback-driven thresholding.
Abstract
In recent years, the surge in unstructured data analysis, facilitated by advancements in Machine Learning (ML), has prompted diverse approaches for handling images, text documents, and videos. Analysts, leveraging ML models, can extract meaningful information from unstructured data and store it in relational databases, allowing the execution of SQL queries for further analysis. Simultaneously, vector databases have emerged, embedding unstructured data for efficient top-k queries based on textual queries. This paper introduces a novel framework SSQL - Semantic SQL that utilizes these two approaches, enabling the incorporation of semantic queries within SQL statements. Our approach extends SQL queries with dedicated keywords for specifying semantic queries alongside predicates related to ML model results and metadata. Our experimental results show that using just semantic queries fails catastrophically to answer count and spatial queries in more than 60% of the cases. Our proposed method jointly optimizes the queries containing both semantic predicates and predicates on structured tables, such as those generated by ML models or other metadata. Further, to improve the query results, we incorporated human-in-the-loop feedback to determine the optimal similarity score threshold for returning results.
