StructTest: Benchmarking LLMs' Reasoning through Compositional Structured Outputs

Hailin Chen; Fangkai Jiao; Mathieu Ravaut; Nawshad Farruque; Xuan Phi Nguyen; Chengwei Qin; Manan Dey; Bosheng Ding; Caiming Xiong; Shafiq Joty; Yingbo Zhou

StructTest: Benchmarking LLMs' Reasoning through Compositional Structured Outputs

Hailin Chen, Fangkai Jiao, Mathieu Ravaut, Nawshad Farruque, Xuan Phi Nguyen, Chengwei Qin, Manan Dey, Bosheng Ding, Caiming Xiong, Shafiq Joty, Yingbo Zhou

TL;DR

StructTest offers a programmatically verifiable benchmark for evaluating LLMs on following compositional instructions that produce structured outputs, addressing biases and data contamination common in existing benchmarks. By separating the Domain Task from Format Rules and using a deterministic rule-based evaluator, it enables scalable, contamination-resistant assessment across Summarization, Code, HTML, and Math, validated over 17 models. The results reveal persistent gaps even among top models, especially on Hard tasks, and demonstrate strong correlations with established reasoning benchmarks such as ChatBot Arena and MMLU, supporting StructTest as a practical proxy for general reasoning. Its design supports on-the-fly updates and extensibility, making it a valuable complementary tool for robust, objective model evaluation in rapidly advancing LLM research.

Abstract

The rapid advancement of large language models (LLMs) demands robust, unbiased, and scalable evaluation methods. However, human annotations are costly to scale, model-based evaluations are susceptible to stylistic biases, and target-answer-based benchmarks are vulnerable to data contamination and cheating. To address these limitations, we propose StructTest, a novel benchmark that evaluates LLMs on their ability to follow compositional instructions and generate structured outputs, providing an unbiased, cost-effective, and difficult-to-cheat evaluation framework. Assessments are conducted deterministically using a rule-based evaluator, which can be easily extended to new tasks and datasets. By testing structured outputs across diverse domains including Summarization, Code, HTML, and Math, and evaluating 17 popular LLMs, we demonstrate that StructTest remains challenging even for top-performing models like Deepseek-V3/R1 and GPT-4o, establishing it as a robust proxy for measuring reasoning capabilities. We believe StructTest offers a critical and complementary approach to achieving objective and comprehensive model evaluation.

StructTest: Benchmarking LLMs' Reasoning through Compositional Structured Outputs

TL;DR

Abstract

StructTest: Benchmarking LLMs' Reasoning through Compositional Structured Outputs

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (17)