Technical Challenges in Maintaining Tax Prep Software with Large Language Models
Sina Gogani-Khiabani, Varsha Dewangan, Nina Olson, Ashutosh Trivedi, Saeid Tizpaz-Niari
TL;DR
This paper addresses automating maintenance of tax preparation software as US tax law evolves. It proposes an LLM-based workflow that translates IRS amendment language into executable code, followed by automated ranking and metamorphic validation within a TenForty framework. The authors introduce a weighted ranking scheme combining CodeBertScore and MajorityVoteScore (weights $0.6$ and $0.4$) and validate updates through metamorphic testing, achieving iterative refinement via a Feedback Prompt Generator. Experiments show that providing prior-year code context improves accuracy and that GPT-4 generally outperforms GPT-3.5 under error-tolerance thresholds expressed as $\delta$, suggesting a feasible path toward autonomy in tax software maintenance.
Abstract
As the US tax law evolves to adapt to ever-changing politico-economic realities, tax preparation software plays a significant role in helping taxpayers navigate these complexities. The dynamic nature of tax regulations poses a significant challenge to accurately and timely maintaining tax software artifacts. The state-of-the-art in maintaining tax prep software is time-consuming and error-prone as it involves manual code analysis combined with an expert interpretation of tax law amendments. We posit that the rigor and formality of tax amendment language, as expressed in IRS publications, makes it amenable to automatic translation to executable specifications (code). Our research efforts focus on identifying, understanding, and tackling technical challenges in leveraging Large Language Models (LLMs), such as ChatGPT and Llama, to faithfully extract code differentials from IRS publications and automatically integrate them with the prior version of the code to automate tax prep software maintenance.
