GNN: Graph Neural Network and Large Language Model for Data Discovery

Thomas Hoang

GNN: Graph Neural Network and Large Language Model for Data Discovery

Thomas Hoang

TL;DR

The paper addresses data discovery for datasets with mixed numerical and textual attributes, where users struggle to specify explicit utility functions. It proposes GNN, a multimodal framework that combines PLOD, Graph Neural Networks, and Large Language Models to learn a unified utility $u(\boldsymbol{x}) = \sum_{j=1}^{m} \beta_{j,\text{num}} x_j^{\text{num}} + \sum_{k=1}^{n} \beta_{k,\text{text}} x_k^{\text{text}} + \epsilon$, then constructs a synthetic utility $u_{\text{syn}}$ and refines it to a real utility $u_{\text{real}}$ to identify optimal data subsets. Experimental results on Boston and Kaggle housing datasets show that GNN achieves higher precision and stability than baselines, with competitive runtimes, illustrating the benefit of integrating GNNs and LLMs for multimodal data discovery. The work demonstrates the practical potential of multimodal utility learning for analytics and suggests future work on runtime optimization, extension to additional data types, and integration with advanced storage for scalability.

Abstract

Our algorithm GNN: Graph Neural Network and Large Language Model for Data Discovery inherit the benefits of \cite{hoang2024plod} (PLOD: Predictive Learning Optimal Data Discovery), \cite{Hoang2024BODBO} (BOD: Blindly Optimal Data Discovery) in terms of overcoming the challenges of having to predefine utility function and the human input for attribute ranking, which helps prevent the time-consuming loop process. In addition to these previous works, our algorithm GNN leverages the advantages of graph neural networks and large language models to understand text type values that cannot be understood by PLOD and MOD, thus making the task of predicting outcomes more reliable. GNN could be seen as an extension of PLOD in terms of understanding the text type value and the user's preferences, not only numerical values but also text values, making the promise of data science and analytics purposes.

GNN: Graph Neural Network and Large Language Model for Data Discovery

TL;DR

, then constructs a synthetic utility

and refines it to a real utility

to identify optimal data subsets. Experimental results on Boston and Kaggle housing datasets show that GNN achieves higher precision and stability than baselines, with competitive runtimes, illustrating the benefit of integrating GNNs and LLMs for multimodal data discovery. The work demonstrates the practical potential of multimodal utility learning for analytics and suggests future work on runtime optimization, extension to additional data types, and integration with advanced storage for scalability.

Abstract

Paper Structure (18 sections, 7 equations, 4 figures, 7 tables, 1 algorithm)

This paper contains 18 sections, 7 equations, 4 figures, 7 tables, 1 algorithm.

Introduction
Related Studies
Problem Definition
Model Representation
Goal
Numerical Data Processing
Textual Data Processing
Synthetic Utility Function
Real Utility Function
Final goal
Experimental Analysis
Runtime Comparison
Precision Comparison Across Algorithms
Precision Comparison Between PLOD and GNN Models
Precision and Stability Comparison Using Boston Housing Data nair2021boston
...and 3 more sections

Figures (4)

Figure 1: Flowchart of GNN: Graph Neural Networks and Large Language Models for Data Discovery
Figure 2: Runtime comparison with changes in number of tuples.
Figure 3: Precision comparison with changes in number of tuples.
Figure 4: Precision comparison with changes in number of tuples.

GNN: Graph Neural Network and Large Language Model for Data Discovery

TL;DR

Abstract

GNN: Graph Neural Network and Large Language Model for Data Discovery

Authors

TL;DR

Abstract

Table of Contents

Figures (4)