FiMI: A Domain-Specific Language Model for Indian Finance Ecosystem

Aboli Kathar; Aman Kumar; Anusha Kamath; Araveeti Srujan; Ashish Sharma; Chandra Bhushan; Dilip Asbe; Divya Sorate; Duddu Prasanth Kumar; Evan Acharya; Harsh Sharma; Hrithik Kadam; Kanishk Singla; Keyur Doshi; Kiran Praveen; Kolisetty Krishna SK; Krishanu Adhikary; Lokesh MPT; Mayurdeep Sonowal; Nadeem Shaikh; Navya Prakash; Nimit Kothari; Nitin Kukreja; Prashant Devadiga; Rakesh Paul; Ratanjeet Pratap Chauhan; Raunak Kalani; Raviraj Joshi; Shamanth MH; Shantanu Pandey; Shubham Soni; Siddharth Dixit; Smriti Jopat; Sunil Patel; Suraj Singh; Suvradip Paul; Tulasi Pilla; Utkarsh Vaidya; Vineeth Nambiar; Vishal Kanvaty; Yatharth Dedhia

FiMI: A Domain-Specific Language Model for Indian Finance Ecosystem

Aboli Kathar, Aman Kumar, Anusha Kamath, Araveeti Srujan, Ashish Sharma, Chandra Bhushan, Dilip Asbe, Divya Sorate, Duddu Prasanth Kumar, Evan Acharya, Harsh Sharma, Hrithik Kadam, Kanishk Singla, Keyur Doshi, Kiran Praveen, Kolisetty Krishna SK, Krishanu Adhikary, Lokesh MPT, Mayurdeep Sonowal, Nadeem Shaikh, Navya Prakash, Nimit Kothari, Nitin Kukreja, Prashant Devadiga, Rakesh Paul, Ratanjeet Pratap Chauhan, Raunak Kalani, Raviraj Joshi, Shamanth MH, Shantanu Pandey, Shubham Soni, Siddharth Dixit, Smriti Jopat, Sunil Patel, Suraj Singh, Suvradip Paul, Tulasi Pilla, Utkarsh Vaidya, Vineeth Nambiar, Vishal Kanvaty, Yatharth Dedhia

TL;DR

FiMI introduces two domain-specific LLMs for India's financial ecosystem, FiMI Base and FiMI Instruct, built atop Mistral Small 24B. The authors deploy a multi-stage training pipeline—Continuous Pre-Training on a large India-focused corpus, followed by Instruction Fine-Tuning and Domain-Supervised Fine-Tuning with synthetic UPI-Help data—to internalize finance workflows, regulatory constraints, and multilingual interactions. They report approximately 20% domain-specific gains and substantial improvements in domain tool-calling precision, while preserving general capabilities similar to larger models. The work demonstrates strong practical impact by enabling NPCI's UPI Help with reliable, compliant, and multilingual support, and outlines a replicable blueprint for domain adaptation in regulated financial settings using synthetic data, tool usage, and safety-focused post-training.

Abstract

We present FiMI (Finance Model for India), a domain-specialized financial language model developed for Indian digital payment systems. We develop two model variants: FiMI Base and FiMI Instruct. FiMI adapts the Mistral Small 24B architecture through a multi-stage training pipeline, beginning with continuous pre-training on 68 Billion tokens of curated financial, multilingual (English, Hindi, Hinglish), and synthetic data. This is followed by instruction fine-tuning and domain-specific supervised fine-tuning focused on multi-turn, tool-driven conversations that model real-world workflows, such as transaction disputes and mandate lifecycle management. Evaluations reveal that FiMI Base achieves a 20% improvement over the Mistral Small 24B Base model on finance reasoning benchmark, while FiMI Instruct outperforms the Mistral Small 24B Instruct model by 87% on domain-specific tool-calling. Moreover, FiMI achieves these significant domain gains while maintaining comparable performance to models of similar size on general benchmarks.

FiMI: A Domain-Specific Language Model for Indian Finance Ecosystem

TL;DR

Abstract

Paper Structure (85 sections, 5 equations, 24 figures, 22 tables)

This paper contains 85 sections, 5 equations, 24 figures, 22 tables.

Introduction
Methodology
Why Context Engineering Fails for NPCI ?
Why Mistral Small 24B?
Training Phases
Continuous Pre-Training (CPT)
Data Preparation
Data Sourcing
Dataset Profiling
Data Pre-processing
Data Composition
Generation of Question--Answer Pairs (Auxiliary Preparation)
Evaluation
Evaluation Datasets
Measuring Strategies
...and 70 more sections

Figures (24)

Figure 1: Analysis of token distribution across the selected corpora, illustrating the effectiveness of the filtering pipeline in isolating high-Alignment segments from diverse sources like Dolma, NeMo, FineWeb, and CommonPile.
Figure 2: Reasoning Density and Structural Quality Across Math and Code Corpora. NeMo and ProofPile-2 exhibit high structured reasoning, while code datasets balance real-world diversity with curated synthetic structure.
Figure 3: Data Curation and CPT Pipeline. End-to-end workflow illustrating raw data ingestion through multi-stage pre-processing, quality filtering, anonymization, and domain-specific partitioning for the construction of general-purpose and specialized financial corpora.
Figure 4: Sunburst diagram representing payments ecosystem taxonomy. This Topic modelling representation was created for all the 5 main domains
Figure 5: Automated Evaluation Pipeline. Workflow illustrating the two-stage assessment process for reasoning-based tasks, highlighting the generation phase using pretrained model APIs and the subsequent scoring phase using the DeepEval framework with a dedicated judge model.
...and 19 more figures

FiMI: A Domain-Specific Language Model for Indian Finance Ecosystem

TL;DR

Abstract

FiMI: A Domain-Specific Language Model for Indian Finance Ecosystem

Authors

TL;DR

Abstract

Table of Contents

Figures (24)