Parameter-Efficient Instruction Tuning of Large Language Models For Extreme Financial Numeral Labelling

Subhendu Khatuya; Rajdeep Mukherjee; Akash Ghosh; Manjunath Hegde; Koustuv Dasgupta; Niloy Ganguly; Saptarshi Ghosh; Pawan Goyal

Parameter-Efficient Instruction Tuning of Large Language Models For Extreme Financial Numeral Labelling

Subhendu Khatuya, Rajdeep Mukherjee, Akash Ghosh, Manjunath Hegde, Koustuv Dasgupta, Niloy Ganguly, Saptarshi Ghosh, Pawan Goyal

TL;DR

This work reframes extreme financial numeral labeling (XFNL) as a generative, instruction-tuned task and introduces FLAN-FinXC, a two-stage, parameter-efficient framework that first generates XBRL tag documentations for numerals and then maps them to final tags via a Tag Matcher. By leveraging label metadata and LoRA-based PEFT on FLAN-T5-Large, the approach delivers state-of-the-art Macro-F1 scores on FNXL ($66.23\%$) and FiNER, while exhibiting strong zero-shot capabilities ($58.89$ Macro-F1 on unseen labels) and robustness on rare labels. The study systematically evaluates model variants, ablations, and comparisons with ChatGPT, highlighting the key role of task-specific instruction prompts and metadata embeddings in extreme classification. The results demonstrate the practical value of instruction-tuned LLMs for scalable, finance-domain label mapping and point toward incorporating external financial knowledge and human-in-the-loop feedback for further improvements.

Abstract

We study the problem of automatically annotating relevant numerals (GAAP metrics) occurring in the financial documents with their corresponding XBRL tags. Different from prior works, we investigate the feasibility of solving this extreme classification problem using a generative paradigm through instruction tuning of Large Language Models (LLMs). To this end, we leverage metric metadata information to frame our target outputs while proposing a parameter efficient solution for the task using LoRA. We perform experiments on two recently released financial numeric labeling datasets. Our proposed model, FLAN-FinXC, achieves new state-of-the-art performances on both the datasets, outperforming several strong baselines. We explain the better scores of our proposed model by demonstrating its capability for zero-shot as well as the least frequently occurring tags. Also, even when we fail to predict the XBRL tags correctly, our generated output has substantial overlap with the ground-truth in majority of the cases.

Parameter-Efficient Instruction Tuning of Large Language Models For Extreme Financial Numeral Labelling

TL;DR

) and FiNER, while exhibiting strong zero-shot capabilities (

Macro-F1 on unseen labels) and robustness on rare labels. The study systematically evaluates model variants, ablations, and comparisons with ChatGPT, highlighting the key role of task-specific instruction prompts and metadata embeddings in extreme classification. The results demonstrate the practical value of instruction-tuned LLMs for scalable, finance-domain label mapping and point toward incorporating external financial knowledge and human-in-the-loop feedback for further improvements.

Abstract

Paper Structure (20 sections, 4 figures, 11 tables)

This paper contains 20 sections, 4 figures, 11 tables.

Introduction
Related Works
Problem Formulation
Methodology
Baselines
Dataset & Evaluation Metrics
Experimental Setup
Main Results
Analysis
Performance on least frequent labels
Zero-Shot Capability
Ablation Study
Comparison with ChatGPT
Experimental comparison among models
Conclusion
...and 5 more sections

Figures (4)

Figure 1: Demonstrating the challenges in the Extreme Financial Numeral Labelling (XFNL) task. Within a financial statement, there are scenarios where every numeral is associated with a distinct XBRL tag, such as in Example 2 (6 distinct tags). Then, there are cases where a mixture of both relevant and irrelevant numerals (tagged 'Others') coexist in the same statement, often within a very limited context, such as in Examples 1 & 3.
Figure 2: FLAN-FinXC Architecture. FLAN-T5 takes as input a task-specific instruction, the financial statement, and a question with a designated target numeral. FLAN-T5-generated tag documentation subsequently flows into the Tag Matcher that predicts the final tag for the given numeral.
Figure 3: Relative improvement in performance achieved by FLAN-T5-Large with LoRA over AttentionXML Pipeline, for the least frequent labels under various frequency buckets
Figure 4: Comparing errors by best proposed model and those by the closest baseline. Even when our model generates incorrect tags, most of them are semantically very similar to the ground truth tags. But AttentionXML often generates completely unrelated tags.

Parameter-Efficient Instruction Tuning of Large Language Models For Extreme Financial Numeral Labelling

TL;DR

Abstract

Parameter-Efficient Instruction Tuning of Large Language Models For Extreme Financial Numeral Labelling

Authors

TL;DR

Abstract

Table of Contents

Figures (4)