Table of Contents
Fetching ...

Complex Logical Instruction Generation

Mian Zhang, Shujian Liu, Sixun Dong, Ming Yin, Yebowen Hu, Xun Wang, Steven Ma, Song Wang, Sathish Reddy Indurthi, Haoyun Deng, Zhiyu Zoey Chen, Kaiqiang Song

TL;DR

This work tackles the evaluation gap in instruction-following when tasks require intricate logic by introducing LogicIFGen, a scalable framework that converts code functions into verifiable, logic-rich natural-language instructions, and LogicIFEval, a 426-task benchmark built from challenging simulation problems. The approach anonymizes functions, augments them with state trackers, and uses multi-turn generation and verification to ensure instructions precisely implement the full underlying logic, with complexity quantified via AST-based metrics. Experimental results reveal a substantial performance gap among both frontier and open-source LLMs, with best models around 85% accuracy while many lag below 60%, and a clear degradation as logic complexity increases. The findings suggest explicit thinking can improve instruction-following for large models and point to future work where LogicIFGen could support training and evaluation to build more robust, logic-aware agents and tools.

Abstract

Instruction following has catalyzed the recent era of Large Language Models (LLMs) and is the foundational skill underpinning more advanced capabilities such as reasoning and agentic behaviors. As tasks grow more challenging, the logic structures embedded in natural language instructions becomes increasingly intricate. However, how well LLMs perform on such logic-rich instructions remains under-explored. We propose LogicIFGen and LogicIFEval. LogicIFGen is a scalable, automated framework for generating verifiable instructions from code functions, which can naturally express rich logic such as conditions, loops, and function calls. We further curate a collection of complex code functions and use LogicIFGen to construct LogicIFEval, a benchmark comprising 426 verifiable logic-rich instructions. Our experiments demonstrate that current state-of-the-art LLMs still struggle to correctly follow the instructions in LogicIFEval. Most LLMs can only follow fewer than 60% of the instructions, revealing significant deficiencies in the instruction-following ability. Code and Benchmark: https://github.com/mianzhang/LogicIF

Complex Logical Instruction Generation

TL;DR

This work tackles the evaluation gap in instruction-following when tasks require intricate logic by introducing LogicIFGen, a scalable framework that converts code functions into verifiable, logic-rich natural-language instructions, and LogicIFEval, a 426-task benchmark built from challenging simulation problems. The approach anonymizes functions, augments them with state trackers, and uses multi-turn generation and verification to ensure instructions precisely implement the full underlying logic, with complexity quantified via AST-based metrics. Experimental results reveal a substantial performance gap among both frontier and open-source LLMs, with best models around 85% accuracy while many lag below 60%, and a clear degradation as logic complexity increases. The findings suggest explicit thinking can improve instruction-following for large models and point to future work where LogicIFGen could support training and evaluation to build more robust, logic-aware agents and tools.

Abstract

Instruction following has catalyzed the recent era of Large Language Models (LLMs) and is the foundational skill underpinning more advanced capabilities such as reasoning and agentic behaviors. As tasks grow more challenging, the logic structures embedded in natural language instructions becomes increasingly intricate. However, how well LLMs perform on such logic-rich instructions remains under-explored. We propose LogicIFGen and LogicIFEval. LogicIFGen is a scalable, automated framework for generating verifiable instructions from code functions, which can naturally express rich logic such as conditions, loops, and function calls. We further curate a collection of complex code functions and use LogicIFGen to construct LogicIFEval, a benchmark comprising 426 verifiable logic-rich instructions. Our experiments demonstrate that current state-of-the-art LLMs still struggle to correctly follow the instructions in LogicIFEval. Most LLMs can only follow fewer than 60% of the instructions, revealing significant deficiencies in the instruction-following ability. Code and Benchmark: https://github.com/mianzhang/LogicIF

Paper Structure

This paper contains 32 sections, 22 equations, 10 figures, 1 table.

Figures (10)

  • Figure 1: (Left) Instruction Following Test: LLMs are required to follow only the natural language instruction to simulate every logic of a code function via generating text. (Right) Overall instruction-following performance of evaluated models on LogicIFEval; NT denotes NoThinking. DS denotes DeepSeek.
  • Figure 2: Pipeline of LogicIFGen. Given a seed function and its corresponding test cases, LogicIFGen generates natural language instructions along with gold labels, which include both the function outputs and the values of state trackers. 1) The input function is first anonymized and augmented with state trackers. 2) The anonymized function is then translated into a natural language description, producing a instruction that precisely describes its logic and expected behavior with test cases verified to have no execution errors. 3) Finally, the test cases are executed on the anonymized function to obtain the gold labels.
  • Figure 3: Error Type Distribution in Test Cases
  • Figure 4: Error distribution across complexity intervals. Blue points represent average error counts across 8 models at each complexity interval, with red lines indicating linear trends. The top-left panel shows all error types combined, while the other panels show individual failure modes.
  • Figure 5: Error Cases: On the left are the excerpts from function codes where the model makes errors. On the right are excerpts from the LLMs’ responses, highlighting their failures across different modes. The explanations for model failures are indicated in red, and the corresponding code lines are highlighted. Please note that the model only has access to the natural language instruction, which could correctly describe the logic, when solving the tasks; the code is provided here solely to facilitate understanding of the errors.
  • ...and 5 more figures