IGDA: Interactive Graph Discovery through Large Language Model Agents
Alex Havrilla, David Alvarez-Melis, Nicolo Fusi
TL;DR
IGDA presents an LLM-based framework for interactive graph discovery that relies on semantic metadata rather than numerical data. It interleaves uncertainty-driven edge experimentation with local updates to neighboring edges, iterating over $R$ rounds with $I$ tests per round to minimize $d(\hat{G}_R, G^*)$ and maximize $F1(G^*, \hat{G}_R)$. Across eight real-world graphs, IGDA often outperforms baselines, including an adaptation of a state-of-the-art numerical method, and ablations confirm the centrality of uncertainty-based selection and local prompting. A memorization test on a July 2024 brain graph shows robust performance even when the graph is not in the LLM’s training data, underscoring the method’s generality. IGDA thus provides a powerful, complementary approach to existing numerical causal discovery techniques by leveraging semantic metadata and interactive feedback.
Abstract
Large language models ($\textbf{LLMs}$) have emerged as a powerful method for discovery. Instead of utilizing numerical data, LLMs utilize associated variable $\textit{semantic metadata}$ to predict variable relationships. Simultaneously, LLMs demonstrate impressive abilities to act as black-box optimizers when given an objective $f$ and sequence of trials. We study LLMs at the intersection of these two capabilities by applying LLMs to the task of $\textit{interactive graph discovery}$: given a ground truth graph $G^*$ capturing variable relationships and a budget of $I$ edge experiments over $R$ rounds, minimize the distance between the predicted graph $\hat{G}_R$ and $G^*$ at the end of the $R$-th round. To solve this task we propose $\textbf{IGDA}$, a LLM-based pipeline incorporating two key components: 1) an LLM uncertainty-driven method for edge experiment selection 2) a local graph update strategy utilizing binary feedback from experiments to improve predictions for unselected neighboring edges. Experiments on eight different real-world graphs show our approach often outperforms all baselines including a state-of-the-art numerical method for interactive graph discovery. Further, we conduct a rigorous series of ablations dissecting the impact of each pipeline component. Finally, to assess the impact of memorization, we apply our interactive graph discovery strategy to a complex, new (as of July 2024) causal graph on protein transcription factors, finding strong performance in a setting where memorization is impossible. Overall, our results show IGDA to be a powerful method for graph discovery complementary to existing numerically driven approaches.
