Table of Contents
Fetching ...

The Format Tax

Ivan Yee Lee, Loris D'Antoni, Taylor Berg-Kirkpatrick

Abstract

Asking a large language model to respond in JSON should be a formatting choice, not a capability tax. Yet we find that structured output requirements -- JSON, XML, LaTeX, Markdown -- substantially degrade reasoning and writing performance across open-weight models. The research response has focused on constrained decoding, but sampling bias accounts for only a fraction of the degradation. The dominant cost enters at the prompt: format-requesting instructions alone cause most of the accuracy loss, before any decoder constraint is applied. This diagnosis points to a simple principle: decouple reasoning from formatting. Whether by generating freeform first and reformatting in a second pass, or by enabling extended thinking within a single generation, separating the two concerns substantially recovers lost accuracy. Across six open-weight models, four API models, four formats, and tasks spanning math, science, logic, and writing, decoupling recovers most lost accuracy. Notably, most recent closed-weight models show little to no format tax, suggesting the problem is not inherent to structured generation but a gap that current open-weight models have yet to close. Code is available at https://github.com/ivnle/the-format-tax.

The Format Tax

Abstract

Asking a large language model to respond in JSON should be a formatting choice, not a capability tax. Yet we find that structured output requirements -- JSON, XML, LaTeX, Markdown -- substantially degrade reasoning and writing performance across open-weight models. The research response has focused on constrained decoding, but sampling bias accounts for only a fraction of the degradation. The dominant cost enters at the prompt: format-requesting instructions alone cause most of the accuracy loss, before any decoder constraint is applied. This diagnosis points to a simple principle: decouple reasoning from formatting. Whether by generating freeform first and reformatting in a second pass, or by enabling extended thinking within a single generation, separating the two concerns substantially recovers lost accuracy. Across six open-weight models, four API models, four formats, and tasks spanning math, science, logic, and writing, decoupling recovers most lost accuracy. Notably, most recent closed-weight models show little to no format tax, suggesting the problem is not inherent to structured generation but a gap that current open-weight models have yet to close. Code is available at https://github.com/ivnle/the-format-tax.

Paper Structure

This paper contains 33 sections, 3 equations, 4 figures, 10 tables.

Figures (4)

  • Figure 1: Source attribution of the format tax. (a) Each point is a (model, task, format) combination; the x-axis shows accuracy drop from the format-requesting prompt alone (GET·), while the y-axis adds grammar-constrained decoding (GETC). Points near the diagonal indicate GCD adds little beyond the prompt effect. (b) Parallel McNemar tests ($p < 0.05$) on 72 cells from open-weight models classify each by whether degradation is already present under the format-requesting prompt alone, added by GCD, or both.
  • Figure 2: Per-question flip rate (freeform-correct $\to$ structured-incorrect) by token delta decile. The U-shaped pattern is nearly identical with and without GCD.
  • Figure 3: Per-question thinking tokens under freeform vs. structured output. Each point is one question; pairs where either side hits the 8,192-token ceiling are excluded. The strong correlation ($r = 0.72$, $n = 28{,}000$+) indicates that thinking effort is driven by question difficulty, not format constraints.
  • Figure 4: Per-model flip rate by token delta decile. The U-shaped pattern from Figure \ref{['fig:flip_rate']} holds across all six models: questions where structured output most compresses or expands token count relative to freeform show the highest flip rates.