Table of Contents
Fetching ...

Can LLMs Identify Tax Abuse?

Andrew Blair-Stanek, Nils Holzenberger, Benjamin Van Durme

TL;DR

This work probes whether large language models can discover, verify, and generate U.S. tax-minimization strategies by applying tax-law authorities to real-world cases. It introduces Shelter Check, a domain-expert dataset of 36 strategies with authorities, background facts, goals, steps, analyses, and adversarial steps, and evaluates multiple LLMs on analysis verification, goal verification, adversarial robustness, step-cloze completion, and free-form strategy generation. Key findings include that state-of-the-art LLMs show mixed performance with occasional novel outputs, notably a novel strategy from o1-preview, but exhibit biases and limited reliability in adversarial and from-scratch generation tasks. The results illuminate the potential of LLMs to aid tax agencies in identifying and countering tax abuse while highlighting the need for careful prompting, domain expertise, and safeguards to prevent misuse. The dataset and code release, including restricted access to most strategies, underscores an emphasis on research that supports tax enforcement rather than evasion.

Abstract

We investigate whether large language models can discover and analyze U.S. tax-minimization strategies. This real-world domain challenges even seasoned human experts, and progress can reduce tax revenue lost from well-advised, wealthy taxpayers. We evaluate the most advanced LLMs on their ability to (1) interpret and verify tax strategies, (2) fill in gaps in partially specified strategies, and (3) generate complete, end-to-end strategies from scratch. This domain should be of particular interest to the LLM reasoning community: unlike synthetic challenge problems or scientific reasoning tasks, U.S. tax law involves navigating hundreds of thousands of pages of statutes, case law, and administrative guidance, all updated regularly. Notably, LLM-based reasoning identified an entirely novel tax strategy, highlighting these models' potential to revolutionize tax agencies' fight against tax abuse.

Can LLMs Identify Tax Abuse?

TL;DR

This work probes whether large language models can discover, verify, and generate U.S. tax-minimization strategies by applying tax-law authorities to real-world cases. It introduces Shelter Check, a domain-expert dataset of 36 strategies with authorities, background facts, goals, steps, analyses, and adversarial steps, and evaluates multiple LLMs on analysis verification, goal verification, adversarial robustness, step-cloze completion, and free-form strategy generation. Key findings include that state-of-the-art LLMs show mixed performance with occasional novel outputs, notably a novel strategy from o1-preview, but exhibit biases and limited reliability in adversarial and from-scratch generation tasks. The results illuminate the potential of LLMs to aid tax agencies in identifying and countering tax abuse while highlighting the need for careful prompting, domain expertise, and safeguards to prevent misuse. The dataset and code release, including restricted access to most strategies, underscores an emphasis on research that supports tax enforcement rather than evasion.

Abstract

We investigate whether large language models can discover and analyze U.S. tax-minimization strategies. This real-world domain challenges even seasoned human experts, and progress can reduce tax revenue lost from well-advised, wealthy taxpayers. We evaluate the most advanced LLMs on their ability to (1) interpret and verify tax strategies, (2) fill in gaps in partially specified strategies, and (3) generate complete, end-to-end strategies from scratch. This domain should be of particular interest to the LLM reasoning community: unlike synthetic challenge problems or scientific reasoning tasks, U.S. tax law involves navigating hundreds of thousands of pages of statutes, case law, and administrative guidance, all updated regularly. Notably, LLM-based reasoning identified an entirely novel tax strategy, highlighting these models' potential to revolutionize tax agencies' fight against tax abuse.

Paper Structure

This paper contains 21 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: Diagram of a novel tax strategy generated by o1-preview during our experiments. Creating such strategies normally requires extensive effort from expensive, specialized domain experts. While we find no LLMs consistently generating workable strategies, to our knowledge this is the first time an LLM has invented a tax strategy.
  • Figure 2: Step-Cloze Grades with 0-, 1-, and 2-shot prompting. Grades could be 0 (worst), 1, 2, or 3 (best). Mean grades appear over each bar. The three reasoning models outperformed the baseline llama-3.3-70B.
  • Figure 3: Domain expert grades of the strategies LLMs generated from scratch. Grades could be 0 (worst), 1, 2, or 3 (best). O1-preview had more strategies than claude-3.5 with the highest grade.
  • Figure 4: Confusion matrices showing how well models' grading of free-form strategies compares to the domain-expert grades.