Table of Contents
Fetching ...

Technical Challenges in Maintaining Tax Prep Software with Large Language Models

Sina Gogani-Khiabani, Varsha Dewangan, Nina Olson, Ashutosh Trivedi, Saeid Tizpaz-Niari

TL;DR

This paper addresses automating maintenance of tax preparation software as US tax law evolves. It proposes an LLM-based workflow that translates IRS amendment language into executable code, followed by automated ranking and metamorphic validation within a TenForty framework. The authors introduce a weighted ranking scheme combining CodeBertScore and MajorityVoteScore (weights $0.6$ and $0.4$) and validate updates through metamorphic testing, achieving iterative refinement via a Feedback Prompt Generator. Experiments show that providing prior-year code context improves accuracy and that GPT-4 generally outperforms GPT-3.5 under error-tolerance thresholds expressed as $\delta$, suggesting a feasible path toward autonomy in tax software maintenance.

Abstract

As the US tax law evolves to adapt to ever-changing politico-economic realities, tax preparation software plays a significant role in helping taxpayers navigate these complexities. The dynamic nature of tax regulations poses a significant challenge to accurately and timely maintaining tax software artifacts. The state-of-the-art in maintaining tax prep software is time-consuming and error-prone as it involves manual code analysis combined with an expert interpretation of tax law amendments. We posit that the rigor and formality of tax amendment language, as expressed in IRS publications, makes it amenable to automatic translation to executable specifications (code). Our research efforts focus on identifying, understanding, and tackling technical challenges in leveraging Large Language Models (LLMs), such as ChatGPT and Llama, to faithfully extract code differentials from IRS publications and automatically integrate them with the prior version of the code to automate tax prep software maintenance.

Technical Challenges in Maintaining Tax Prep Software with Large Language Models

TL;DR

This paper addresses automating maintenance of tax preparation software as US tax law evolves. It proposes an LLM-based workflow that translates IRS amendment language into executable code, followed by automated ranking and metamorphic validation within a TenForty framework. The authors introduce a weighted ranking scheme combining CodeBertScore and MajorityVoteScore (weights and ) and validate updates through metamorphic testing, achieving iterative refinement via a Feedback Prompt Generator. Experiments show that providing prior-year code context improves accuracy and that GPT-4 generally outperforms GPT-3.5 under error-tolerance thresholds expressed as , suggesting a feasible path toward autonomy in tax software maintenance.

Abstract

As the US tax law evolves to adapt to ever-changing politico-economic realities, tax preparation software plays a significant role in helping taxpayers navigate these complexities. The dynamic nature of tax regulations poses a significant challenge to accurately and timely maintaining tax software artifacts. The state-of-the-art in maintaining tax prep software is time-consuming and error-prone as it involves manual code analysis combined with an expert interpretation of tax law amendments. We posit that the rigor and formality of tax amendment language, as expressed in IRS publications, makes it amenable to automatic translation to executable specifications (code). Our research efforts focus on identifying, understanding, and tackling technical challenges in leveraging Large Language Models (LLMs), such as ChatGPT and Llama, to faithfully extract code differentials from IRS publications and automatically integrate them with the prior version of the code to automate tax prep software maintenance.

Paper Structure

This paper contains 9 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: TenForty: General Framework using Disability and EITC benefits as examples. Our approach specifies the correctness requirements from relevant tax policies. Then, it generates random test cases and infers decision trees to localize circumstances under which the software fails to satisfy metamorphic requirements.
  • Figure 2: Updating tax brackets without prior software code. Prior code is listed only for clarity to understand CodeBertScore calculation logic; it does not impact the code generation process.
  • Figure 3: Updating Tax Brackets with Prior Software Code.
  • Figure 4: AI-assisted framework to update tax software following the updated tax policies.
  • Figure 5: Scenarios without prior code for 4 top ranked candidates per ChatGPT-3.5/4.0.
  • ...and 1 more figures