Table of Contents
Fetching ...

FinStat2SQL: A Text2SQL Pipeline for Financial Statement Analysis

Quang Hung Nguyen, Phuong Anh Trinh, Phan Quoc Hung Mai, Tuan Phong Trinh

TL;DR

The paper presents FinStat2SQL, a lightweight, domain-specific text2sql pipeline tailored to Vietnamese financial standards (VAS) that combines a multi-agent NL-to-SQL flow with a domain-focused database and synthetic QA data. It demonstrates that fine-tuned, smaller models can rival larger proprietary systems when paired with vector-based retrieval, self-correction, and decomposition strategies, achieving strong accuracy and sub-4-second latency for practical use in Vietnamese enterprises. The work also introduces a hybrid, finance-aware evaluation framework and a robust dataset, showing that a 7B finetuned model can perform competitively, while proprietary Gemini models achieve the highest accuracy (around 72%), highlighting the trade-offs between cost and performance. While offering a practical solution for local financial analytics, the study notes limitations in scope and terminology and points to future work on broader coverage, cross-national standards, and predictive analytics to extend applicability and impact.

Abstract

Despite the advancements of large language models, text2sql still faces many challenges, particularly with complex and domain-specific queries. In finance, database designs and financial reporting layouts vary widely between financial entities and countries, making text2sql even more challenging. We present FinStat2SQL, a lightweight text2sql pipeline enabling natural language queries over financial statements. Tailored to local standards like VAS, it combines large and small language models in a multi-agent setup for entity extraction, SQL generation, and self-correction. We build a domain-specific database and evaluate models on a synthetic QA dataset. A fine-tuned 7B model achieves 61.33\% accuracy with sub-4-second response times on consumer hardware, outperforming GPT-4o-mini. FinStat2SQL offers a scalable, cost-efficient solution for financial analysis, making AI-powered querying accessible to Vietnamese enterprises.

FinStat2SQL: A Text2SQL Pipeline for Financial Statement Analysis

TL;DR

The paper presents FinStat2SQL, a lightweight, domain-specific text2sql pipeline tailored to Vietnamese financial standards (VAS) that combines a multi-agent NL-to-SQL flow with a domain-focused database and synthetic QA data. It demonstrates that fine-tuned, smaller models can rival larger proprietary systems when paired with vector-based retrieval, self-correction, and decomposition strategies, achieving strong accuracy and sub-4-second latency for practical use in Vietnamese enterprises. The work also introduces a hybrid, finance-aware evaluation framework and a robust dataset, showing that a 7B finetuned model can perform competitively, while proprietary Gemini models achieve the highest accuracy (around 72%), highlighting the trade-offs between cost and performance. While offering a practical solution for local financial analytics, the study notes limitations in scope and terminology and points to future work on broader coverage, cross-national standards, and predictive analytics to extend applicability and impact.

Abstract

Despite the advancements of large language models, text2sql still faces many challenges, particularly with complex and domain-specific queries. In finance, database designs and financial reporting layouts vary widely between financial entities and countries, making text2sql even more challenging. We present FinStat2SQL, a lightweight text2sql pipeline enabling natural language queries over financial statements. Tailored to local standards like VAS, it combines large and small language models in a multi-agent setup for entity extraction, SQL generation, and self-correction. We build a domain-specific database and evaluate models on a synthetic QA dataset. A fine-tuned 7B model achieves 61.33\% accuracy with sub-4-second response times on consumer hardware, outperforming GPT-4o-mini. FinStat2SQL offers a scalable, cost-efficient solution for financial analysis, making AI-powered querying accessible to Vietnamese enterprises.

Paper Structure

This paper contains 24 sections, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Diagram of the FinStat2SQL Pipeline: The language model first parses and analyzes the user's prompt, retrieves relevant company database information, and then iteratively executes and debugs SQL queries until a correct result is obtained, which is returned as the final answer. More information can be found in Sec. \ref{['sec:Approach']}.
  • Figure 2: Diagram of Entity Extraction and Row Selection process
  • Figure 3: SQL generation process
  • Figure 4: Question Answering with FinStat2SQL
  • Figure :