CAPITU: A Benchmark for Evaluating Instruction-Following in Brazilian Portuguese with Literary Context

Giovana Kerche Bonás; Roseval Malaquias Junior; Marcos Piau; Thiago Laitz; Thales Sales Almeida; Hugo Abonizio; Celio Larcher; Ramon Pires; Rodrigo Nogueira

CAPITU: A Benchmark for Evaluating Instruction-Following in Brazilian Portuguese with Literary Context

Giovana Kerche Bonás, Roseval Malaquias Junior, Marcos Piau, Thiago Laitz, Thales Sales Almeida, Hugo Abonizio, Celio Larcher, Ramon Pires, Rodrigo Nogueira

Abstract

We introduce CAPITU, a benchmark for evaluating instruction-following capabilities of Large Language Models (LLMs) in Brazilian Portuguese. Unlike existing benchmarks that focus on English or use generic prompts, CAPITU contextualizes all tasks within eight canonical works of Brazilian literature, combining verifiable instruction constraints with culturally-grounded content. The benchmark comprises 59 instruction types organized into seven categories, all designed to be automatically verifiable without requiring LLM judges or human evaluation. Instruction types include Portuguese-specific linguistic constraints (word termination patterns like -ando/-endo/-indo, -inho/-inha, -mente) and structural requirements. We evaluate 18 state-of-the-art models across single-turn and multi-turn settings. Our results show that frontier reasoning models achieve strong performance (GPT-5.2 with reasoning: 98.5% strict accuracy), while Portuguese-specialized models offer competitive cost-efficiency (Sabiazinho-4: 87.0% at \$0.13 vs Claude-Haiku-4.5: 73.5% at \$1.12). Multi-turn evaluation reveals significant variation in constraint persistence, with conversation-level accuracy ranging from 60% to 96% across models. We identify specific challenges in morphological constraints, exact counting, and constraint persistence degradation across turns. We release the complete benchmark, evaluation code, and baseline results to facilitate research on instruction-following in Portuguese.

CAPITU: A Benchmark for Evaluating Instruction-Following in Brazilian Portuguese with Literary Context

Abstract

1.12). Multi-turn evaluation reveals significant variation in constraint persistence, with conversation-level accuracy ranging from 60% to 96% across models. We identify specific challenges in morphological constraints, exact counting, and constraint persistence degradation across turns. We release the complete benchmark, evaluation code, and baseline results to facilitate research on instruction-following in Portuguese.

Paper Structure (68 sections, 1 equation, 4 figures, 6 tables)

This paper contains 68 sections, 1 equation, 4 figures, 6 tables.

Introduction
Contributions
Related Work
Instruction-Following Evaluation
Native and Culturally-Grounded Benchmarks
Instruction-Tuning and Evaluation in Portuguese
Positioning of CAPITU
Methodology
Design Principles
Instruction Taxonomy
Portuguese-Specific Instructions
Structural Instructions
Literary Corpus
Prompt Generation Pipeline
Architecture Overview
...and 53 more sections

Figures (4)

Figure 1: Overview of CAPITU's main contributions.
Figure 2: Overview of the CAPITU methodology
Figure 3: Example of an assembled prompt showing task template (top) and instruction constraints (bottom).
Figure 4: Complete coherence evaluation prompt structure. The system message defines the judge's role; the user message provides the original prompt, the model's response, and literary metadata as context; the judge returns a JSON score with justification.

CAPITU: A Benchmark for Evaluating Instruction-Following in Brazilian Portuguese with Literary Context

Abstract

CAPITU: A Benchmark for Evaluating Instruction-Following in Brazilian Portuguese with Literary Context

Authors

Abstract

Table of Contents

Figures (4)