Table of Contents
Fetching ...

Agentar-DeepFinance-100K: A Large-Scale Financial Dataset via Systematic Chain-of-Thought Synthesis Optimization

Xiaoke Zhao, Zhaowen Zhou, Lin Chen, Lihong Wang, Zhiyi Huang, Kaiyuan Zheng, Yanjun Zheng, Xiyang Du, Longfei Liao, Jiawei Liu, Xiang Qi, Bo Zhang, Peng Zhang, Wei Wang, Zhe Li

TL;DR

Agentar-DeepFinance-100K tackles robust financial reasoning by building a large-scale CoT-rich dataset through systematic synthesis. The pipeline combines Multi-perspective Knowledge Extraction and Self-Corrective Rewriting to produce deep, diverse CoT trajectories, while the CoT Cube analyzes factors such as necessity, length, and synthesizer. Empirical results show models trained on the dataset outperform baselines on financial reasoning benchmarks, with ablations highlighting the value of MKE and SCR. The dataset's rich metadata and expert-annotated in-domain data aim to bridge training and real-world financial reasoning tasks, offering a resource to advance financially grounded LLMs.

Abstract

Recent advancements in large language models (LLMs) have demonstrated remarkable general reasoning capabilities, holding significant potential for applications in the financial domain, a field that requires robust and reliable reasoning. It has been demonstrated that distilling high-quality chain-of-thought (CoT) rationales from advanced general reasoning models offers a promising and efficient path to the financial reasoning model. However, existing CoT synthesis methods suffer from shallow CoT sampling, leaving the question of how to construct a well-designed knowledge space for finance reasoning unexplored. In this paper, we present Agentar-DeepFinance-100K, a large-scale financial reasoning dataset characterized by its systematic CoT synthesis optimization. We first introduce a comprehensive CoT synthesis pipeline featuring Multi-perspective Knowledge Extraction (MKE) and Self-Corrective Rewriting (SCR) to generate exhaustive and deep financial reasoning trajectories. Furthermore, a systematic investigation, termed CoT Cube, is conducted to analyze critical factors that influence CoT effectiveness, such as necessity, length and synthesizer, yielding valuable insights for high-quality financial CoT construction. Experiments demonstrate that models trained on our Agentar-DeepFinance-100K achieve significant improvements on financial benchmarks. We publicly release Agentar-DeepFinance-100K , hoping to advance the research in financial reasoning models.

Agentar-DeepFinance-100K: A Large-Scale Financial Dataset via Systematic Chain-of-Thought Synthesis Optimization

TL;DR

Agentar-DeepFinance-100K tackles robust financial reasoning by building a large-scale CoT-rich dataset through systematic synthesis. The pipeline combines Multi-perspective Knowledge Extraction and Self-Corrective Rewriting to produce deep, diverse CoT trajectories, while the CoT Cube analyzes factors such as necessity, length, and synthesizer. Empirical results show models trained on the dataset outperform baselines on financial reasoning benchmarks, with ablations highlighting the value of MKE and SCR. The dataset's rich metadata and expert-annotated in-domain data aim to bridge training and real-world financial reasoning tasks, offering a resource to advance financially grounded LLMs.

Abstract

Recent advancements in large language models (LLMs) have demonstrated remarkable general reasoning capabilities, holding significant potential for applications in the financial domain, a field that requires robust and reliable reasoning. It has been demonstrated that distilling high-quality chain-of-thought (CoT) rationales from advanced general reasoning models offers a promising and efficient path to the financial reasoning model. However, existing CoT synthesis methods suffer from shallow CoT sampling, leaving the question of how to construct a well-designed knowledge space for finance reasoning unexplored. In this paper, we present Agentar-DeepFinance-100K, a large-scale financial reasoning dataset characterized by its systematic CoT synthesis optimization. We first introduce a comprehensive CoT synthesis pipeline featuring Multi-perspective Knowledge Extraction (MKE) and Self-Corrective Rewriting (SCR) to generate exhaustive and deep financial reasoning trajectories. Furthermore, a systematic investigation, termed CoT Cube, is conducted to analyze critical factors that influence CoT effectiveness, such as necessity, length and synthesizer, yielding valuable insights for high-quality financial CoT construction. Experiments demonstrate that models trained on our Agentar-DeepFinance-100K achieve significant improvements on financial benchmarks. We publicly release Agentar-DeepFinance-100K , hoping to advance the research in financial reasoning models.

Paper Structure

This paper contains 15 sections, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Overview of our proposed CoT synthesis pipeline (left) and systematic investigation of factors that impact CoT effectiveness (right).
  • Figure 2: Illustration of the format of Agentar-DeepFinance-100K , which comprises three components: (1) the question, (2) the solution, including CoT and the final answer and (3) metadata, which encompasses multi-dimensional annotations such as complexity, quality and language.
  • Figure 3: Task composition.
  • Figure 4: Complexity distribution.
  • Figure 5: Illustration of the pipeline for constructing Agentar-DeepFinance-100K .
  • ...and 5 more figures