PASemiQA: Plan-Assisted Agent for Question Answering on Semi-Structured Data with Text and Relational Information
Hansi Yang, Qi Zhang, Wei Jiang, Jianguo Li
TL;DR
PASemiQA tackles QA on semi-structured data by combining text and relational information through a two-stage plan-guided approach. A planning module generates informative node sets $\mathcal{V}_q$ and relation paths $\{z_i\}$, which guides an LLM-based Graph Traversing Agent to extract evidence and produce answers. The learning signal aligns plan generation with ground-truth paths via a KL divergence between $Q(z|q,a^*,\mathcal{G})$ and $P_\theta(z|q)$, using instruction-tuned LLMs for path generation. Across STaRK datasets (Amazon, MAG, PrimeKG), PASemiQA delivers state-of-the-art Hit@1 scores with competitive latency, demonstrating improved accuracy and reliability for QA on semi-structured data.
Abstract
Large language models (LLMs) have shown impressive abilities in answering questions across various domains, but they often encounter hallucination issues on questions that require professional and up-to-date knowledge. To address this limitation, retrieval-augmented generation (RAG) techniques have been proposed, which retrieve relevant information from external sources to inform their responses. However, existing RAG methods typically focus on a single type of external data, such as vectorized text database or knowledge graphs, and cannot well handle real-world questions on semi-structured data containing both text and relational information. To bridge this gap, we introduce PASemiQA, a novel approach that jointly leverages text and relational information in semi-structured data to answer questions. PASemiQA first generates a plan to identify relevant text and relational information to answer the question in semi-structured data, and then uses an LLM agent to traverse the semi-structured data and extract necessary information. Our empirical results demonstrate the effectiveness of PASemiQA across different semi-structured datasets from various domains, showcasing its potential to improve the accuracy and reliability of question answering systems on semi-structured data.
