Table of Contents
Fetching ...

Language Models and Logic Programs for Trustworthy Tax Reasoning

William Jurayj, Nils Holzenberger, Benjamin Van Durme

TL;DR

The paper tackles the challenge of trustworthy statutory tax reasoning by pairing large language models with a symbolic solver to compute tax obligations. It reframes statutory reasoning as semantic parsing, translating statutes and cases into executable logic programs (Prolog) and evaluating on the SARA dataset. Key findings show that hybrid neuro-symbolic setups, especially with gold statutes and exemplars, can dramatically reduce error costs and bring break-even pricing below real-world filing costs (e.g., $15.78$), while maintaining auditability through symbolic execution. This approach promises more accessible, reliable tax guidance and highlights the practical viability of scalable, auditable tax-assistance systems.

Abstract

According to the United States Internal Revenue Service, ``the average American spends $\$270$ and 13 hours filing their taxes''. Even beyond the U.S., tax filing requires complex reasoning, combining application of overlapping rules with numerical calculations. Because errors can incur costly penalties, any automated system must deliver high accuracy and auditability, making modern large language models (LLMs) poorly suited for this task. We propose an approach that integrates LLMs with a symbolic solver to calculate tax obligations. We evaluate variants of this system on the challenging StAtutory Reasoning Assessment (SARA) dataset, and include a novel method for estimating the cost of deploying such a system based on real-world penalties for tax errors. We further show how combining up-front translation of plain-text rules into formal logic programs, combined with intelligently retrieved exemplars for formal case representations, can dramatically improve performance on this task and reduce costs to well below real-world averages. Our results demonstrate the effectiveness of applying semantic parsing methods to statutory reasoning, and show promising economic feasibility of neuro-symbolic architectures for increasing access to reliable tax assistance.

Language Models and Logic Programs for Trustworthy Tax Reasoning

TL;DR

The paper tackles the challenge of trustworthy statutory tax reasoning by pairing large language models with a symbolic solver to compute tax obligations. It reframes statutory reasoning as semantic parsing, translating statutes and cases into executable logic programs (Prolog) and evaluating on the SARA dataset. Key findings show that hybrid neuro-symbolic setups, especially with gold statutes and exemplars, can dramatically reduce error costs and bring break-even pricing below real-world filing costs (e.g., ), while maintaining auditability through symbolic execution. This approach promises more accessible, reliable tax guidance and highlights the practical viability of scalable, auditable tax-assistance systems.

Abstract

According to the United States Internal Revenue Service, ``the average American spends 270$ and 13 hours filing their taxes''. Even beyond the U.S., tax filing requires complex reasoning, combining application of overlapping rules with numerical calculations. Because errors can incur costly penalties, any automated system must deliver high accuracy and auditability, making modern large language models (LLMs) poorly suited for this task. We propose an approach that integrates LLMs with a symbolic solver to calculate tax obligations. We evaluate variants of this system on the challenging StAtutory Reasoning Assessment (SARA) dataset, and include a novel method for estimating the cost of deploying such a system based on real-world penalties for tax errors. We further show how combining up-front translation of plain-text rules into formal logic programs, combined with intelligently retrieved exemplars for formal case representations, can dramatically improve performance on this task and reduce costs to well below real-world averages. Our results demonstrate the effectiveness of applying semantic parsing methods to statutory reasoning, and show promising economic feasibility of neuro-symbolic architectures for increasing access to reliable tax assistance.

Paper Structure

This paper contains 16 sections, 1 equation, 5 figures, 3 tables.

Figures (5)

  • Figure 1: A taxpayer confronted with a tax question might choose between an inexpensive AI preparer and a costlier human professional. The decision considers trade-offs between cost, convenience, and confidence in the result.
  • Figure 2: Methods for solving.Top Left: Plain-text for statutes and a case is fed into a language model, along with the instruction to calculate a person's tax obligation. Top Right: Statutes and a case are fed into the model as before, but it is instructed to convert these into a logic program which calculates a person's tax obligation. If the SWI-Prolog engine fails to execute the program, the case is considered unanswered. Bottom: A language model parses a case's facts into Prolog, conditioned on gold parses of the most relevant cases and of the rules contained in the statutes. The symbolic solver imports the gold parses of the statutes before attempting to execute the generated parse of the case. Note that unlike the approaches above it, this requires gold symbolic representations of both the statutes and a representative selection of correctly-decided cases.
  • Figure 3: Number of correct and incorrect solutions produced by each solution method, for large chat- and reasoning-optimized models (served by DeepSeek and OpenAI).
  • Figure 4: Success and failure rates of method mixtures: The top right corner counts the average number of successes yielded by each method combination, and the bottom left corner counts the average number of failures for models over 100 billion parameters optimized for reasoning (DeepSeek R1 and OpenAI o3) and chat (DeepSeek V3 and GPT-4.1)
  • Figure 5: Standalone parsing success rate improves with model size. While smaller reasoning models fail to solve nearly every problem in this setting, larger reasoning models show improvement with model size.