CReSt: A Comprehensive Benchmark for Retrieval-Augmented Generation with Complex Reasoning over Structured Documents
Minsoo Khang, Sangjun Park, Teakgyu Hong, Dawoon Jung
TL;DR
CReSt presents a holistic benchmark for retrieval-augmented generation by simultaneously probing complex reasoning, refusal handling, precise citation, and understanding of structured (HTML) documents. It builds a bilingual English/Korean dataset from scratch, consisting of 2,245 QA instances and over 20,000 document chunks (roughly half HTML), with a multi-stage QA generation pipeline, negative chunk retrieval, and thorough human verification. The evaluation framework uses an LLM-as-a-judge, a unified scoring scheme, and diverse inference methods to reveal strengths and weaknesses across state-of-the-art models, showing substantial room for improvement in practical RAG deployment. The dataset and code release aim to accelerate research and development of robust, real-world RAG systems that can reason, ground, and responsibly handle uncertainty.
Abstract
Large Language Models (LLMs) have made substantial progress in recent years, yet evaluating their capabilities in practical Retrieval-Augmented Generation (RAG) scenarios remains challenging. In practical applications, LLMs must demonstrate complex reasoning, refuse to answer appropriately, provide precise citations, and effectively understand document layout. These capabilities are crucial for advanced task handling, uncertainty awareness, maintaining reliability, and structural understanding. While some of the prior works address these aspects individually, there is a need for a unified framework that evaluates them collectively in practical RAG scenarios. To address this, we present CReSt (A Comprehensive Benchmark for Retrieval-Augmented Generation with Complex Reasoning over Structured Documents), a benchmark designed to assess these key dimensions holistically. CReSt comprises 2,245 human-annotated examples in English and Korean, designed to capture practical RAG scenarios that require complex reasoning over structured documents. It also introduces a tailored evaluation methodology to comprehensively assess model performance in these critical areas. Our evaluation shows that even advanced LLMs struggle to perform consistently across these dimensions, underscoring key areas for improvement. We release CReSt to support further research and the development of more robust RAG systems. The dataset and code are available at: https://github.com/UpstageAI/CReSt.
