Table of Contents
Fetching ...

Benchmarking Large Language Models on Controllable Generation under Diversified Instructions

Yihan Chen, Benfeng Xu, Quan Wang, Yi Liu, Zhendong Mao

TL;DR

CoDI-Eval introduces a diversified-instruction CTG benchmark to systematically evaluate LLMs' ability to follow constraints expressed in natural language. The approach combines instruction expansion and diversification to maximize coverage and generalization across five CTG tasks, including a multi-aspect variant, with automated, task-specific evaluation. Experiments across a wide range of open-source and commercial LLMs show that while commercial models achieve higher accuracy, substantial gaps remain—especially for multi-aspect and length-constrained generation—and diversity-enhanced instructions improve evaluation robustness. The work highlights the importance of instruction diversity, automated evaluation reliability, and the persistent gap between open-source and closed-source models, offering a foundation for future advancements in controllable generation and LLM alignment.

Abstract

While large language models (LLMs) have exhibited impressive instruction-following capabilities, it is still unclear whether and to what extent they can respond to explicit constraints that might be entailed in various instructions. As a significant aspect of LLM alignment, it is thus important to formulate such a specialized set of instructions as well as investigate the resulting behavior of LLMs. To address this vacancy, we propose a new benchmark CoDI-Eval to systematically and comprehensively evaluate LLMs' responses to instructions with various constraints. We construct a large collection of constraints-attributed instructions as a test suite focused on both generalization and coverage. Specifically, we advocate an instruction diversification process to synthesize diverse forms of constraint expression and also deliberate the candidate task taxonomy with even finer-grained sub-categories. Finally, we automate the entire evaluation process to facilitate further developments. Different from existing studies on controllable text generation, CoDI-Eval extends the scope to the prevalent instruction-following paradigm for the first time. We provide extensive evaluations of representative LLMs (e.g., ChatGPT, Vicuna) on CoDI-Eval, revealing their limitations in following instructions with specific constraints and there is still a significant gap between open-source and commercial closed-source LLMs. We believe this benchmark will facilitate research into improving the controllability of LLMs' responses to instructions. Our data and code are available at https://github.com/Xt-cyh/CoDI-Eval.

Benchmarking Large Language Models on Controllable Generation under Diversified Instructions

TL;DR

CoDI-Eval introduces a diversified-instruction CTG benchmark to systematically evaluate LLMs' ability to follow constraints expressed in natural language. The approach combines instruction expansion and diversification to maximize coverage and generalization across five CTG tasks, including a multi-aspect variant, with automated, task-specific evaluation. Experiments across a wide range of open-source and commercial LLMs show that while commercial models achieve higher accuracy, substantial gaps remain—especially for multi-aspect and length-constrained generation—and diversity-enhanced instructions improve evaluation robustness. The work highlights the importance of instruction diversity, automated evaluation reliability, and the persistent gap between open-source and closed-source models, offering a foundation for future advancements in controllable generation and LLM alignment.

Abstract

While large language models (LLMs) have exhibited impressive instruction-following capabilities, it is still unclear whether and to what extent they can respond to explicit constraints that might be entailed in various instructions. As a significant aspect of LLM alignment, it is thus important to formulate such a specialized set of instructions as well as investigate the resulting behavior of LLMs. To address this vacancy, we propose a new benchmark CoDI-Eval to systematically and comprehensively evaluate LLMs' responses to instructions with various constraints. We construct a large collection of constraints-attributed instructions as a test suite focused on both generalization and coverage. Specifically, we advocate an instruction diversification process to synthesize diverse forms of constraint expression and also deliberate the candidate task taxonomy with even finer-grained sub-categories. Finally, we automate the entire evaluation process to facilitate further developments. Different from existing studies on controllable text generation, CoDI-Eval extends the scope to the prevalent instruction-following paradigm for the first time. We provide extensive evaluations of representative LLMs (e.g., ChatGPT, Vicuna) on CoDI-Eval, revealing their limitations in following instructions with specific constraints and there is still a significant gap between open-source and commercial closed-source LLMs. We believe this benchmark will facilitate research into improving the controllability of LLMs' responses to instructions. Our data and code are available at https://github.com/Xt-cyh/CoDI-Eval.
Paper Structure (57 sections, 3 equations, 19 figures, 14 tables)

This paper contains 57 sections, 3 equations, 19 figures, 14 tables.

Figures (19)

  • Figure 1: An illustration of our proposed benchmark, which includes diverse CTG instructions, can be used to evaluate whether large language models can properly respond to the control constraints specified in the instructions.
  • Figure 2: Performance of typical LLMs on CoDI-Eval.
  • Figure 3: Base CTG tasks and their corresponding control attributes we select. Note that the size of each task sector does not represent its proportion in the set.
  • Figure 4: The framework of constructing evaluation instruction sets. It consists of two steps: expansion and diversification.
  • Figure 5: An example of the zero-shot prompt. The black part is the prompt while the green part is the output of LLM.
  • ...and 14 more figures