Table of Contents
Fetching ...

Open-domain Implicit Format Control for Large Language Model Generation

Yiqun Yao, Wenjia Ma, Xuezhi Fang, Xin Jiang, Xiang Li, Xuying Meng, Peng Han, Jing Li, Aixin Sun, Yequan Wang

TL;DR

This work tackles open-domain format control in LLM generation by introducing Open-domain Implicit Format Control (OIFC), a decoding framework that enforces implicit output formats demonstrated by one-shot examples. It formalizes the approach with the equation $y = f(x; {p, q_{one_shot}, r_{one_shot}})$ and builds a data collection pipeline (OIFC-SFT) using diverse Chinese instruction sources to train models via supervised fine-tuning. An evaluation protocol measuring helpfulness and format correctness on in- and out-of-distribution sets shows that OIFC-SFT substantially improves format adherence with minimal impact on helpfulness across AF-7B and FLM-2-52B. The work provides publicly available datasets and code, advancing scalable, user-driven control of open-domain formats in LLM generation and offering a practical path to deployment-ready formatting controls.

Abstract

Controlling the format of outputs generated by large language models (LLMs) is a critical functionality in various applications. Current methods typically employ constrained decoding with rule-based automata or fine-tuning with manually crafted format instructions, both of which struggle with open-domain format requirements. To address this limitation, we introduce a novel framework for controlled generation in LLMs, leveraging user-provided, one-shot QA pairs. This study investigates LLMs' capabilities to follow open-domain, one-shot constraints and replicate the format of the example answers. We observe that this is a non-trivial problem for current LLMs. We also develop a dataset collection methodology for supervised fine-tuning that enhances the open-domain format control of LLMs without degrading output quality, as well as a benchmark on which we evaluate both the helpfulness and format correctness of LLM outputs. The resulting datasets, named OIFC-SFT, along with the related code, will be made publicly available at https://github.com/cofe-ai/OIFC.

Open-domain Implicit Format Control for Large Language Model Generation

TL;DR

This work tackles open-domain format control in LLM generation by introducing Open-domain Implicit Format Control (OIFC), a decoding framework that enforces implicit output formats demonstrated by one-shot examples. It formalizes the approach with the equation and builds a data collection pipeline (OIFC-SFT) using diverse Chinese instruction sources to train models via supervised fine-tuning. An evaluation protocol measuring helpfulness and format correctness on in- and out-of-distribution sets shows that OIFC-SFT substantially improves format adherence with minimal impact on helpfulness across AF-7B and FLM-2-52B. The work provides publicly available datasets and code, advancing scalable, user-driven control of open-domain formats in LLM generation and offering a practical path to deployment-ready formatting controls.

Abstract

Controlling the format of outputs generated by large language models (LLMs) is a critical functionality in various applications. Current methods typically employ constrained decoding with rule-based automata or fine-tuning with manually crafted format instructions, both of which struggle with open-domain format requirements. To address this limitation, we introduce a novel framework for controlled generation in LLMs, leveraging user-provided, one-shot QA pairs. This study investigates LLMs' capabilities to follow open-domain, one-shot constraints and replicate the format of the example answers. We observe that this is a non-trivial problem for current LLMs. We also develop a dataset collection methodology for supervised fine-tuning that enhances the open-domain format control of LLMs without degrading output quality, as well as a benchmark on which we evaluate both the helpfulness and format correctness of LLM outputs. The resulting datasets, named OIFC-SFT, along with the related code, will be made publicly available at https://github.com/cofe-ai/OIFC.
Paper Structure (11 sections, 3 equations, 2 figures, 3 tables)

This paper contains 11 sections, 3 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Data collection pipeline of OIFC-SFT.
  • Figure 2: Statistics for OIFC-SFT.