Table of Contents
Fetching ...

Query Generation Pipeline with Enhanced Answerability Assessment for Financial Information Retrieval

Hyunkyu Kim, Yeeun Yoo, Youngjun Kwak

TL;DR

This paper tackles the lack of domain-specific, multi-document information retrieval benchmarks for finance by introducing a systematic query-generation pipeline and an enhanced, reasoning-augmented answerability assessment. The pipeline produces KoBankIR, an 815-query Korean banking IR benchmark derived from 204 official documents, incorporating both single- and multi-document queries generated through topic-based merging, context deepening, and comparing/contrasting. The evaluation framework combines automatic thinking-driven scoring with human validation, showing improved alignment with human judgments and revealing that existing retrieval models struggle with complex multi-document banking queries. Overall, the work provides a practical methodology for constructing finance-focused IR benchmarks and highlights the need for more effective retrieval techniques in real-world banking scenarios.

Abstract

As financial applications of large language models (LLMs) gain attention, accurate Information Retrieval (IR) remains crucial for reliable AI services. However, existing benchmarks fail to capture the complex and domain-specific information needs of real-world banking scenarios. Building domain-specific IR benchmarks is costly and constrained by legal restrictions on using real customer data. To address these challenges, we propose a systematic methodology for constructing domain-specific IR benchmarks through LLM-based query generation. As a concrete implementation of this methodology, our pipeline combines single and multi-document query generation with an enhanced and reasoning-augmented answerability assessment method, achieving stronger alignment with human judgments than prior approaches. Using this methodology, we construct KoBankIR, comprising 815 queries derived from 204 official banking documents. Our experiments show that existing retrieval models struggle with the complex multi-document queries in KoBankIR, demonstrating the value of our systematic approach for domain-specific benchmark construction and underscoring the need for improved retrieval techniques in financial domains.

Query Generation Pipeline with Enhanced Answerability Assessment for Financial Information Retrieval

TL;DR

This paper tackles the lack of domain-specific, multi-document information retrieval benchmarks for finance by introducing a systematic query-generation pipeline and an enhanced, reasoning-augmented answerability assessment. The pipeline produces KoBankIR, an 815-query Korean banking IR benchmark derived from 204 official documents, incorporating both single- and multi-document queries generated through topic-based merging, context deepening, and comparing/contrasting. The evaluation framework combines automatic thinking-driven scoring with human validation, showing improved alignment with human judgments and revealing that existing retrieval models struggle with complex multi-document banking queries. Overall, the work provides a practical methodology for constructing finance-focused IR benchmarks and highlights the need for more effective retrieval techniques in real-world banking scenarios.

Abstract

As financial applications of large language models (LLMs) gain attention, accurate Information Retrieval (IR) remains crucial for reliable AI services. However, existing benchmarks fail to capture the complex and domain-specific information needs of real-world banking scenarios. Building domain-specific IR benchmarks is costly and constrained by legal restrictions on using real customer data. To address these challenges, we propose a systematic methodology for constructing domain-specific IR benchmarks through LLM-based query generation. As a concrete implementation of this methodology, our pipeline combines single and multi-document query generation with an enhanced and reasoning-augmented answerability assessment method, achieving stronger alignment with human judgments than prior approaches. Using this methodology, we construct KoBankIR, comprising 815 queries derived from 204 official banking documents. Our experiments show that existing retrieval models struggle with the complex multi-document queries in KoBankIR, demonstrating the value of our systematic approach for domain-specific benchmark construction and underscoring the need for improved retrieval techniques in financial domains.

Paper Structure

This paper contains 28 sections, 8 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: The query generation pipeline. (a) Single-document query generation. (b) Multi-document query generation, comprising (1) Topic-based merging, (2) Context deepening, and (3) Comparing and contrasting. The generated queries undergo evaluation to ensure quality and answerability.
  • Figure 2: Hierarchical Structure of the Financial Document Dataset.
  • Figure 3: The process of reasoning-augmented evaluation method.
  • Figure 4: Human evaluation interface.