KaPQA: Knowledge-Augmented Product Question-Answering
Swetha Eppalapally, Daksh Dangi, Chaithra Bhat, Ankita Gupta, Ruiyi Zhang, Shubham Agarwal, Karishma Bagga, Seunghyun Yoon, Nedim Lipka, Ryan A. Rossi, Franck Dernoncourt
TL;DR
KaPQA addresses the lack of domain-specific QA benchmarks by introducing two Photoshop/Acrobat HelpX datasets and a knowledge-driven RAG-QA framework that uses knowledge-base triples to reformulate queries. The approach aims to improve retrieval and long-form answer generation by grounding queries in domain knowledge, with extensive experiments showing improvements over baselines but also highlighting the challenge of real-world product QA. Key findings include the importance of a high-precision triple retriever and the nuanced effects of language model choice (e.g., GPT-3.5 vs GPT-4o) on reformulation quality and retrieval. The work provides valuable benchmarks for enterprise QA and demonstrates how knowledge augmentation can bridge gaps between generic RAG-QA methods and industry-specific needs, though it also reveals room for improvement in robustness and evaluation of long-form outputs.
Abstract
Question-answering for domain-specific applications has recently attracted much interest due to the latest advancements in large language models (LLMs). However, accurately assessing the performance of these applications remains a challenge, mainly due to the lack of suitable benchmarks that effectively simulate real-world scenarios. To address this challenge, we introduce two product question-answering (QA) datasets focused on Adobe Acrobat and Photoshop products to help evaluate the performance of existing models on domain-specific product QA tasks. Additionally, we propose a novel knowledge-driven RAG-QA framework to enhance the performance of the models in the product QA task. Our experiments demonstrated that inducing domain knowledge through query reformulation allowed for increased retrieval and generative performance when compared to standard RAG-QA methods. This improvement, however, is slight, and thus illustrates the challenge posed by the datasets introduced.
