Fine-tuning Smaller Language Models for Question Answering over Financial Documents
Karmvir Singh Phogat, Sai Akhil Puranam, Sridhar Dasaratha, Chetan Harsha, Shashishekar Ramakrishna
TL;DR
This work investigates whether compact language models can match a large teacher's performance in financial question answering that requires multi-hop numerical reasoning. By generating Python programs with Program-of-Thought prompts and externally executing them, the authors fine-tune small models (via LoRA) on teacher-produced data across FinQA, ConvFinQA, and TATQA to achieve competitive results. They introduce a data-curation step to filter flawed teacher code and evaluate concept understanding, entity extraction, and code generation, showing substantial improvements over zero-/few-shot baselines. A key finding is that relatively small training sets can yield near-teacher performance, enabling data-efficient deployment in finance-specific reasoning tasks.
Abstract
Recent research has shown that smaller language models can acquire substantial reasoning abilities when fine-tuned with reasoning exemplars crafted by a significantly larger teacher model. We explore this paradigm for the financial domain, focusing on the challenge of answering questions that require multi-hop numerical reasoning over financial texts. We assess the performance of several smaller models that have been fine-tuned to generate programs that encode the required financial reasoning and calculations. Our findings demonstrate that these fine-tuned smaller models approach the performance of the teacher model. To provide a granular analysis of model performance, we propose an approach to investigate the specific student model capabilities that are enhanced by fine-tuning. Our empirical analysis indicates that fine-tuning refines the student models ability to express and apply the required financial concepts along with adapting the entity extraction for the specific data format. In addition, we hypothesize and demonstrate that comparable financial reasoning capability can be induced using relatively smaller datasets.
