Table of Contents
Fetching ...

ReFF: Reinforcing Format Faithfulness in Language Models across Varied Tasks

Jiashu Yao, Heyan Huang, Zeming Liu, Haoyu Wen, Wei Su, Boao Qian, Yuhang Guo

TL;DR

This work addresses the challenge of ensuring format-faithful outputs from LLMs across varied tasks by introducing FormatBench, a broad benchmark with decidable format checks. It proposes Reinforcing Format Faithfulness (ReFF), a reinforcement-learning–style adaptation that uses per-task format checkers as the reward signal, augmented with a KL penalty to control drift from the original model. Empirical results show FormatBench is challenging for current models, yet ReFF substantially improves format faithfulness (e.g., from $21.6\%$ to $95.0\%$ on caption segmentation) while maintaining or even improving general quality, especially when combined with finetuning (e.g., $GQ$ improvements to $61.6$). An interpretability analysis further explains how the method achieves the observed gains, including trade-offs between format adherence and semantic accuracy, and highlights cases like XDL where outputs can be well-formatted yet semantically misaligned. These findings underscore the practical importance of decidability-guided adaptation for robust, formatted outputs in real-world LLM deployments.

Abstract

Following formatting instructions to generate well-structured content is a fundamental yet often unmet capability for large language models (LLMs). To study this capability, which we refer to as format faithfulness, we present FormatBench, a comprehensive format-related benchmark. Compared to previous format-related benchmarks, FormatBench involves a greater variety of tasks in terms of application scenes (traditional NLP tasks, creative works, autonomous agency tasks), human-LLM interaction styles (single-turn instruction, multi-turn chat), and format types (inclusion, wrapping, length, coding). Moreover, each task in FormatBench is attached with a format checker program. Extensive experiments on the benchmark reveal that state-of-the-art open- and closed-source LLMs still suffer from severe deficiency in format faithfulness. By virtue of the decidable nature of formats, we propose to Reinforce Format Faithfulness (ReFF) to help LLMs generate formatted output as instructed without compromising general quality. Without any annotated data, ReFF can substantially improve the format faithfulness rate (e.g., from 21.6% in original LLaMA3 to 95.0% on caption segmentation task), while keep the general quality comparable (e.g., from 47.3 to 46.4 in F1 scores). Combined with labeled training data, ReFF can simultaneously improve both format faithfulness (e.g., from 21.6% in original LLaMA3 to 75.5%) and general quality (e.g., from 47.3 to 61.6 in F1 scores). We further offer an interpretability analysis to explain how ReFF improves both format faithfulness and general quality.

ReFF: Reinforcing Format Faithfulness in Language Models across Varied Tasks

TL;DR

This work addresses the challenge of ensuring format-faithful outputs from LLMs across varied tasks by introducing FormatBench, a broad benchmark with decidable format checks. It proposes Reinforcing Format Faithfulness (ReFF), a reinforcement-learning–style adaptation that uses per-task format checkers as the reward signal, augmented with a KL penalty to control drift from the original model. Empirical results show FormatBench is challenging for current models, yet ReFF substantially improves format faithfulness (e.g., from to on caption segmentation) while maintaining or even improving general quality, especially when combined with finetuning (e.g., improvements to ). An interpretability analysis further explains how the method achieves the observed gains, including trade-offs between format adherence and semantic accuracy, and highlights cases like XDL where outputs can be well-formatted yet semantically misaligned. These findings underscore the practical importance of decidability-guided adaptation for robust, formatted outputs in real-world LLM deployments.

Abstract

Following formatting instructions to generate well-structured content is a fundamental yet often unmet capability for large language models (LLMs). To study this capability, which we refer to as format faithfulness, we present FormatBench, a comprehensive format-related benchmark. Compared to previous format-related benchmarks, FormatBench involves a greater variety of tasks in terms of application scenes (traditional NLP tasks, creative works, autonomous agency tasks), human-LLM interaction styles (single-turn instruction, multi-turn chat), and format types (inclusion, wrapping, length, coding). Moreover, each task in FormatBench is attached with a format checker program. Extensive experiments on the benchmark reveal that state-of-the-art open- and closed-source LLMs still suffer from severe deficiency in format faithfulness. By virtue of the decidable nature of formats, we propose to Reinforce Format Faithfulness (ReFF) to help LLMs generate formatted output as instructed without compromising general quality. Without any annotated data, ReFF can substantially improve the format faithfulness rate (e.g., from 21.6% in original LLaMA3 to 95.0% on caption segmentation task), while keep the general quality comparable (e.g., from 47.3 to 46.4 in F1 scores). Combined with labeled training data, ReFF can simultaneously improve both format faithfulness (e.g., from 21.6% in original LLaMA3 to 75.5%) and general quality (e.g., from 47.3 to 61.6 in F1 scores). We further offer an interpretability analysis to explain how ReFF improves both format faithfulness and general quality.

Paper Structure

This paper contains 78 sections, 3 equations, 7 figures, 10 tables, 1 algorithm.

Figures (7)

  • Figure 1: The overall framework of this work. The queries in FormatBench are forwarded to an LLM to generate corresponding responses, whose format correctness are labelled by a format checker. The queries, generated responses, and the format labels are utilized in ReFF process to iteratively obtain an adapted LLM with higher format faithfulness.
  • Figure 2: Tasks included in FormatBench with their corresponding groups and data sizes.
  • Figure 3: Conceptual contour map of format faithfulness and general quality. Inner circles indicate higher scores for both metrics. Solely improving format faithfulness (A $\rightarrow$ B) may result in an LLM with high format faithfulness but low general quality. ReFF can get the best of two worlds by combining finetuning (A $\rightarrow$ C) and reinforcement (C $\rightarrow$ D).
  • Figure 4: An instance in XDL task (top), the corresponding response of ReFF-tst (left), and that of ReFF-tst-XDL which obtains a higher format faithfulness rate on XDL task (right). ReFF-tst-XDL generates syntactically correct but irrelevant code.
  • Figure 5: The format checker for CapSeg task.
  • ...and 2 more figures