Table of Contents
Fetching ...

BR-TaxQA-R: A Dataset for Question Answering with References for Brazilian Personal Income Tax Law, including case law

Juvenal Domingos Júnior, Augusto Faria, E. Seiti de Oliveira, Erick de Brito, Matheus Teotonio, Andre Assumpção, Diedre Carmo, Roberto Lotufo, Jayr Pereira

TL;DR

The paper tackles the scarcity of Brazilian Portuguese legal QA data and the need for source-grounded AI in high-stakes tax law. It introduces BR-TaxQA-R, a dataset combining 715 official Q&As from the RFB with normative texts and CARF case law to enable end-to-end RAG evaluation. A domain-specific RAG pipeline with hierarchical segmentation and legal prompting is developed and benchmarked against commercial tools using RAGAS metrics, revealing strengths in relevance for the RAG system but superior factual correctness and fluency in commercial models. The work highlights the trade-off between legally grounded generation and linguistic fluency, stresses the necessity of human expert evaluation for legality, and outlines directions for extending the dataset with multi-year rulings and improved human-in-the-loop validation.

Abstract

This paper presents BR-TaxQA-R, a novel dataset designed to support question answering with references in the context of Brazilian personal income tax law. The dataset contains 715 questions from the 2024 official Q\&A document published by Brazil's Internal Revenue Service, enriched with statutory norms and administrative rulings from the Conselho Administrativo de Recursos Fiscais (CARF). We implement a Retrieval-Augmented Generation (RAG) pipeline using OpenAI embeddings for searching and GPT-4o-mini for answer generation. We compare different text segmentation strategies and benchmark our system against commercial tools such as ChatGPT and Perplexity.ai using RAGAS-based metrics. Results show that our custom RAG pipeline outperforms commercial systems in Response Relevancy, indicating stronger alignment with user queries, while commercial models achieve higher scores in Factual Correctness and fluency. These findings highlight a trade-off between legally grounded generation and linguistic fluency. Crucially, we argue that human expert evaluation remains essential to ensure the legal validity of AI-generated answers in high-stakes domains such as taxation. BR-TaxQA-R is publicly available at https://huggingface.co/datasets/unicamp-dl/BR-TaxQA-R.

BR-TaxQA-R: A Dataset for Question Answering with References for Brazilian Personal Income Tax Law, including case law

TL;DR

The paper tackles the scarcity of Brazilian Portuguese legal QA data and the need for source-grounded AI in high-stakes tax law. It introduces BR-TaxQA-R, a dataset combining 715 official Q&As from the RFB with normative texts and CARF case law to enable end-to-end RAG evaluation. A domain-specific RAG pipeline with hierarchical segmentation and legal prompting is developed and benchmarked against commercial tools using RAGAS metrics, revealing strengths in relevance for the RAG system but superior factual correctness and fluency in commercial models. The work highlights the trade-off between legally grounded generation and linguistic fluency, stresses the necessity of human expert evaluation for legality, and outlines directions for extending the dataset with multi-year rulings and improved human-in-the-loop validation.

Abstract

This paper presents BR-TaxQA-R, a novel dataset designed to support question answering with references in the context of Brazilian personal income tax law. The dataset contains 715 questions from the 2024 official Q\&A document published by Brazil's Internal Revenue Service, enriched with statutory norms and administrative rulings from the Conselho Administrativo de Recursos Fiscais (CARF). We implement a Retrieval-Augmented Generation (RAG) pipeline using OpenAI embeddings for searching and GPT-4o-mini for answer generation. We compare different text segmentation strategies and benchmark our system against commercial tools such as ChatGPT and Perplexity.ai using RAGAS-based metrics. Results show that our custom RAG pipeline outperforms commercial systems in Response Relevancy, indicating stronger alignment with user queries, while commercial models achieve higher scores in Factual Correctness and fluency. These findings highlight a trade-off between legally grounded generation and linguistic fluency. Crucially, we argue that human expert evaluation remains essential to ensure the legal validity of AI-generated answers in high-stakes domains such as taxation. BR-TaxQA-R is publicly available at https://huggingface.co/datasets/unicamp-dl/BR-TaxQA-R.

Paper Structure

This paper contains 18 sections, 1 figure, 4 tables.

Figures (1)

  • Figure 1: Illustration of the trade-off between contextual precision and linguistic fluency.