GNN: Graph Neural Network and Large Language Model for Data Discovery
Thomas Hoang
TL;DR
The paper addresses data discovery for datasets with mixed numerical and textual attributes, where users struggle to specify explicit utility functions. It proposes GNN, a multimodal framework that combines PLOD, Graph Neural Networks, and Large Language Models to learn a unified utility $u(\boldsymbol{x}) = \sum_{j=1}^{m} \beta_{j,\text{num}} x_j^{\text{num}} + \sum_{k=1}^{n} \beta_{k,\text{text}} x_k^{\text{text}} + \epsilon$, then constructs a synthetic utility $u_{\text{syn}}$ and refines it to a real utility $u_{\text{real}}$ to identify optimal data subsets. Experimental results on Boston and Kaggle housing datasets show that GNN achieves higher precision and stability than baselines, with competitive runtimes, illustrating the benefit of integrating GNNs and LLMs for multimodal data discovery. The work demonstrates the practical potential of multimodal utility learning for analytics and suggests future work on runtime optimization, extension to additional data types, and integration with advanced storage for scalability.
Abstract
Our algorithm GNN: Graph Neural Network and Large Language Model for Data Discovery inherit the benefits of \cite{hoang2024plod} (PLOD: Predictive Learning Optimal Data Discovery), \cite{Hoang2024BODBO} (BOD: Blindly Optimal Data Discovery) in terms of overcoming the challenges of having to predefine utility function and the human input for attribute ranking, which helps prevent the time-consuming loop process. In addition to these previous works, our algorithm GNN leverages the advantages of graph neural networks and large language models to understand text type values that cannot be understood by PLOD and MOD, thus making the task of predicting outcomes more reliable. GNN could be seen as an extension of PLOD in terms of understanding the text type value and the user's preferences, not only numerical values but also text values, making the promise of data science and analytics purposes.
