Table of Contents
Fetching ...

StructTest: Benchmarking LLMs' Reasoning through Compositional Structured Outputs

Hailin Chen, Fangkai Jiao, Mathieu Ravaut, Nawshad Farruque, Xuan Phi Nguyen, Chengwei Qin, Manan Dey, Bosheng Ding, Caiming Xiong, Shafiq Joty, Yingbo Zhou

TL;DR

StructTest offers a programmatically verifiable benchmark for evaluating LLMs on following compositional instructions that produce structured outputs, addressing biases and data contamination common in existing benchmarks. By separating the Domain Task from Format Rules and using a deterministic rule-based evaluator, it enables scalable, contamination-resistant assessment across Summarization, Code, HTML, and Math, validated over 17 models. The results reveal persistent gaps even among top models, especially on Hard tasks, and demonstrate strong correlations with established reasoning benchmarks such as ChatBot Arena and MMLU, supporting StructTest as a practical proxy for general reasoning. Its design supports on-the-fly updates and extensibility, making it a valuable complementary tool for robust, objective model evaluation in rapidly advancing LLM research.

Abstract

The rapid advancement of large language models (LLMs) demands robust, unbiased, and scalable evaluation methods. However, human annotations are costly to scale, model-based evaluations are susceptible to stylistic biases, and target-answer-based benchmarks are vulnerable to data contamination and cheating. To address these limitations, we propose StructTest, a novel benchmark that evaluates LLMs on their ability to follow compositional instructions and generate structured outputs, providing an unbiased, cost-effective, and difficult-to-cheat evaluation framework. Assessments are conducted deterministically using a rule-based evaluator, which can be easily extended to new tasks and datasets. By testing structured outputs across diverse domains including Summarization, Code, HTML, and Math, and evaluating 17 popular LLMs, we demonstrate that StructTest remains challenging even for top-performing models like Deepseek-V3/R1 and GPT-4o, establishing it as a robust proxy for measuring reasoning capabilities. We believe StructTest offers a critical and complementary approach to achieving objective and comprehensive model evaluation.

StructTest: Benchmarking LLMs' Reasoning through Compositional Structured Outputs

TL;DR

StructTest offers a programmatically verifiable benchmark for evaluating LLMs on following compositional instructions that produce structured outputs, addressing biases and data contamination common in existing benchmarks. By separating the Domain Task from Format Rules and using a deterministic rule-based evaluator, it enables scalable, contamination-resistant assessment across Summarization, Code, HTML, and Math, validated over 17 models. The results reveal persistent gaps even among top models, especially on Hard tasks, and demonstrate strong correlations with established reasoning benchmarks such as ChatBot Arena and MMLU, supporting StructTest as a practical proxy for general reasoning. Its design supports on-the-fly updates and extensibility, making it a valuable complementary tool for robust, objective model evaluation in rapidly advancing LLM research.

Abstract

The rapid advancement of large language models (LLMs) demands robust, unbiased, and scalable evaluation methods. However, human annotations are costly to scale, model-based evaluations are susceptible to stylistic biases, and target-answer-based benchmarks are vulnerable to data contamination and cheating. To address these limitations, we propose StructTest, a novel benchmark that evaluates LLMs on their ability to follow compositional instructions and generate structured outputs, providing an unbiased, cost-effective, and difficult-to-cheat evaluation framework. Assessments are conducted deterministically using a rule-based evaluator, which can be easily extended to new tasks and datasets. By testing structured outputs across diverse domains including Summarization, Code, HTML, and Math, and evaluating 17 popular LLMs, we demonstrate that StructTest remains challenging even for top-performing models like Deepseek-V3/R1 and GPT-4o, establishing it as a robust proxy for measuring reasoning capabilities. We believe StructTest offers a critical and complementary approach to achieving objective and comprehensive model evaluation.

Paper Structure

This paper contains 32 sections, 4 equations, 17 figures, 5 tables.

Figures (17)

  • Figure 1: Error rate of GPT-4o across different features of the Summarization Bullet Points+Length (Hard) task. As the number of required key points increases or specific organizational requirements are added, the error rate rises significantly.
  • Figure 2: Tag-counts for correct vs. incorrect HTML generations (left) and error rate by total tag counts (binned) (right) for the Hard task in GPT-4o.
  • Figure 3: Error rates of GPT-4o in GSM8K math reasoning across 20 Hard formats.
  • Figure 4: Comparison of StructTest average accuracy with ChatBot Arena scores and MMLU accuracy. ChatBot Arena results are current as of March 13th, 2025.
  • Figure 5: Test example for length task in Summarization.
  • ...and 12 more figures