Transformer-based Ranking Approaches for Keyword Queries over Relational Databases
Paulo Martins, Altigran da Silva, Johny Moreira, Edleno de Moura
TL;DR
This work tackles the problem of ranking Query Matches (QMs) and Candidate Joining Networks (CJNs) in Relational Keyword Search (R-KwS) systems, extending Lathe with transformer-based ranking to handle schema-aware and ambiguous queries. By linearizing relational structures into text-like sequences and applying sentence-transformer models, the authors develop neural QM and CJN ranking, enhanced through data augmentation and task-specific fine-tuning. Experimental results on IMDb, MONDIAL, and Yelp show that fine-tuned transformer models substantially outperform the prior Bayesian baseline in $MRR$ and recall, with multivalue aggregation further boosting CJN ranking. The proposed approach yields more context-aware, relevant results and demonstrates the viability of neural ranking in multi-table relational search, offering practical improvements for keyword-driven data exploration over relational databases.
Abstract
Relational Keyword Search (R-KwS) systems enable naive/informal users to explore and retrieve information from relational databases without requiring schema knowledge or query-language proficiency. Although numerous R-KwS methods have been proposed, most still focus on queries referring only to attribute values or primarily address performance enhancements, providing limited support for queries referencing schema elements. We previously introduced Lathe, a system that accommodates schema-based keyword queries and employs an eager CJN evaluation strategy to filter out spurious Candidate Joining Networks (CJNs). However, Lathe still faces challenges in accurately ranking CJNs when queries are ambiguous. In this work, we propose a new transformer-based ranking approach that provides a more context-aware evaluation of Query Matches (QMs) and CJNs. Our solution introduces a linearization process to convert relational structures into textual sequences suitable for transformer models. It also includes a data augmentation strategy aimed at handling diverse and ambiguous queries more effectively. Experimental results, comparing our transformer-based ranking to Lathe's original Bayesian-based method, show significant improvements in recall and R@k, demonstrating the effectiveness of our neural approach in delivering the most relevant query results.
