Comprehensive Evaluation for a Large Scale Knowledge Graph Question Answering Service

Saloni Potdar; Daniel Lee; Omar Attia; Varun Embar; De Meng; Ramesh Balaji; Chloe Seivwright; Eric Choi; Mina H. Farid; Yiwen Sun; Yunyao Li

Comprehensive Evaluation for a Large Scale Knowledge Graph Question Answering Service

Saloni Potdar, Daniel Lee, Omar Attia, Varun Embar, De Meng, Ramesh Balaji, Chloe Seivwright, Eric Choi, Mina H. Farid, Yiwen Sun, Yunyao Li

TL;DR

KGQA systems must map natural language queries to structured queries over large knowledge graphs, but evaluating such end-to-end systems at industry scale is challenging due to component interactions and evolving data. This paper presents Chronos, a modular evaluation framework that combines automated data collection from logs and synthetic generation, human-in-the-loop annotation, a predictions scraper, and comprehensive metrics with loss bucketization and dashboards to support continuous monitoring. Chronos enables data-driven decisions, profiling end-to-end and component-level performance across diverse data slices and driving improvements in data quality and system reliability. The framework is demonstrated in a case study and designed to be adaptable to real-world enterprise KGQA deployments, with limitations and future directions discussed.

Abstract

Question answering systems for knowledge graph (KGQA), answer factoid questions based on the data in the knowledge graph. KGQA systems are complex because the system has to understand the relations and entities in the knowledge-seeking natural language queries and map them to structured queries against the KG to answer them. In this paper, we introduce Chronos, a comprehensive evaluation framework for KGQA at industry scale. It is designed to evaluate such a multi-component system comprehensively, focusing on (1) end-to-end and component-level metrics, (2) scalable to diverse datasets and (3) a scalable approach to measure the performance of the system prior to release. In this paper, we discuss the unique challenges associated with evaluating KGQA systems at industry scale, review the design of Chronos, and how it addresses these challenges. We will demonstrate how it provides a base for data-driven decisions and discuss the challenges of using it to measure and improve a real-world KGQA system.

Comprehensive Evaluation for a Large Scale Knowledge Graph Question Answering Service

TL;DR

Abstract

Comprehensive Evaluation for a Large Scale Knowledge Graph Question Answering Service

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)