PARAMANU-GANITA: Can Small Math Language Models Rival with Large Language Models on Mathematical Reasoning?
Mitodru Niyogi, Arnab Bhattacharya
TL;DR
Paramanu-Ganita shows that a domain-specific, decoder-only LM trained from scratch on a mathematics-focused corpus can rival larger LLMs in mathematical reasoning while drastically reducing training cost and environmental impact. By combining a purpose-built tokenizer, CoT instruction tuning, and careful data curation, the 208M parameter model achieves strong results on GSM8K, MATH, and related benchmarks, outperforming many 7B–62B LLMs despite orders of magnitude fewer parameters. The work challenges the bigger-is-better paradigm by demonstrating that high-quality domain data and reasoning prompts can yield competitive performance with a fraction of the compute, around $170$ A100 hours in total. Future work will broaden the corpus to include ArXiv math papers and apply reinforcement learning alignment to push performance further, with ongoing emphasis on efficiency and accessibility.
Abstract
In this paper, we study whether domain specific pretraining of small generative language models (SLM) from scratch with domain specialized tokenizer and Chain-of-Thought (CoT) instruction fine-tuning results in competitive performance on mathematical reasoning compared to LLMs? Secondly, whether this approach is environmentally sustainable, highly cost efficient? To address these research questions, we present Paramanu-Ganita, a 208 million-parameter novel decoder-only Auto Regressive SLM on mathematics. We performed pretraining from scratch on 31.5 billion tokens for 170 A100 hours using a context size of 4096 on a mixed mathematical corpus consisting of web pages, source code, textbooks, CoT templatised StackOverflow QA pairs, and mathematical lecture notes in LaTeX curated by us. We also trained a math and code specialised BPE tokenizer. We proposed and performed CoT instruction fine-tuning of Paramanu-Ganita on the MetaMathQA dataset. Our model Paramanu-Ganita, despite being 34 times smaller than the 7B LLMs, outperforms generalist LLMs by approximately 30% points, and even math-specialised LLMs by 3-23% points in GSM8K test accuracy metric. On MATH benchmark, Paramanu-Ganita outperformed the various models by 6-8% points. On benchmarks like LogiQA, MMLU (high school, college level), and competitive exams level, AGIEVAL (AQuA-RAT, SAT-Math), Paramanu-Ganita outperformed others by 1-4%. Our model is available at https://huggingface.co/gyanai/paramanu-ganita-208M-hf .
