Table of Contents
Fetching ...

FMBench: Adaptive Large Language Model Output Formatting

Yaoting Wang, Yun Zhou, Henghui Ding

TL;DR

This work tackles the challenge of reliable Markdown formatting in LLM outputs, a critical aspect for real-world, tool-enabled workflows. It introduces FMBench, a Markdown-focused benchmark, and a lightweight alignment pipeline that combines supervised fine-tuning (SFT) with reinforcement learning fine-tuning (RLFT) using verifiable rewards to jointly optimize semantic fidelity and structural compliance, formalized as $r = \lambda_1 r_{sem} + \lambda_2 r_{struct}$ with $\lambda_1 = \lambda_2 = 1$. Across OpenPangu and Qwen families, SFT consistently improves semantic alignment, while RLFT provides additional gains in robustness to challenging Markdown instructions, with larger models achieving the best joint performance. The dataset comprises 1,100 samples (800 train, 300 test), and the work highlights a fundamental trade-off between content fidelity and formatting regularity, emphasizing reward design and validation as key to dependable formatted generation.

Abstract

Producing outputs that satisfy both semantic intent and format constraints is essential for deploying large language models in user-facing and system-integrated workflows. In this work, we focus on Markdown formatting, which is ubiquitous in assistants, documentation, and tool-augmented pipelines but still prone to subtle, hard-to-detect errors (e.g., broken lists, malformed tables, inconsistent headings, and invalid code blocks) that can significantly degrade downstream usability. We present FMBench, a benchmark for adaptive Markdown output formatting that evaluates models under a wide range of instruction-following scenarios with diverse structural requirements. FMBench emphasizes real-world formatting behaviors such as multi-level organization, mixed content (natural language interleaved with lists/tables/code), and strict adherence to user-specified layout constraints. To improve Markdown compliance without relying on hard decoding constraints, we propose a lightweight alignment pipeline that combines supervised fine-tuning (SFT) with reinforcement learning fine-tuning. Starting from a base model, we first perform SFT on instruction-response pairs, and then optimize a composite objective that balances semantic fidelity with structural correctness. Experiments on two model families (OpenPangu and Qwen) show that SFT consistently improves semantic alignment, while reinforcement learning provides additional gains in robustness to challenging Markdown instructions when initialized from a strong SFT policy. Our results also reveal an inherent trade-off between semantic and structural objectives, highlighting the importance of carefully designed rewards for reliable formatted generation. Code is available at: https://github.com/FudanCVL/FMBench.

FMBench: Adaptive Large Language Model Output Formatting

TL;DR

This work tackles the challenge of reliable Markdown formatting in LLM outputs, a critical aspect for real-world, tool-enabled workflows. It introduces FMBench, a Markdown-focused benchmark, and a lightweight alignment pipeline that combines supervised fine-tuning (SFT) with reinforcement learning fine-tuning (RLFT) using verifiable rewards to jointly optimize semantic fidelity and structural compliance, formalized as with . Across OpenPangu and Qwen families, SFT consistently improves semantic alignment, while RLFT provides additional gains in robustness to challenging Markdown instructions, with larger models achieving the best joint performance. The dataset comprises 1,100 samples (800 train, 300 test), and the work highlights a fundamental trade-off between content fidelity and formatting regularity, emphasizing reward design and validation as key to dependable formatted generation.

Abstract

Producing outputs that satisfy both semantic intent and format constraints is essential for deploying large language models in user-facing and system-integrated workflows. In this work, we focus on Markdown formatting, which is ubiquitous in assistants, documentation, and tool-augmented pipelines but still prone to subtle, hard-to-detect errors (e.g., broken lists, malformed tables, inconsistent headings, and invalid code blocks) that can significantly degrade downstream usability. We present FMBench, a benchmark for adaptive Markdown output formatting that evaluates models under a wide range of instruction-following scenarios with diverse structural requirements. FMBench emphasizes real-world formatting behaviors such as multi-level organization, mixed content (natural language interleaved with lists/tables/code), and strict adherence to user-specified layout constraints. To improve Markdown compliance without relying on hard decoding constraints, we propose a lightweight alignment pipeline that combines supervised fine-tuning (SFT) with reinforcement learning fine-tuning. Starting from a base model, we first perform SFT on instruction-response pairs, and then optimize a composite objective that balances semantic fidelity with structural correctness. Experiments on two model families (OpenPangu and Qwen) show that SFT consistently improves semantic alignment, while reinforcement learning provides additional gains in robustness to challenging Markdown instructions when initialized from a strong SFT policy. Our results also reveal an inherent trade-off between semantic and structural objectives, highlighting the importance of carefully designed rewards for reliable formatted generation. Code is available at: https://github.com/FudanCVL/FMBench.
Paper Structure (26 sections, 1 equation, 3 figures, 2 tables)

This paper contains 26 sections, 1 equation, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Overview of a four-stage pipeline for converting raw documents into structured data. (1) Document Crawling: Documents are collected from multiple domains, including academic, official, technical, legal, business, and educational sources. (2) Document Cleaning: Documents are cleaned and standardized across eight data categories to ensure quality and consistency. (3) LLM Formatting: LLMs apply three formatting rules to produce structured data at varying difficulty levels. (4) Human Refining: Human experts review and refine the outputs to correct grammar, improve style, and ensure structural compliance.
  • Figure 2: Distribution of key structural parameters in the FMBench dataset. The figure shows the marginal distributions of four core structural variables: section count, nested list depth, number of list items, and blockquote count, computed over the combined training and test sets. The distributions reveal a highly controlled design with symmetric or unimodal patterns, where structural complexity is primarily modulated by list-related constraints rather than by section or blockquote counts.
  • Figure 3: Training pipeline.