Table of Contents
Fetching ...

Beyond Instruction Following: Evaluating Inferential Rule Following of Large Language Models

Wangtao Sun, Chenxiang Zhang, XueYou Zhang, Xuanqing Yu, Ziyang Huang, Pei Chen, Haotian Xu, Shizhu He, Jun Zhao, Kang Liu

TL;DR

This work defines inferential rule-following as a distinct capability from instruction-following and introduces RuleBench, a multi-domain benchmark to evaluate LLMs' ability to trigger and apply abstract rules. It analyzes diverse open- and closed-source LLMs across rule settings, revealing that current models struggle with inferential rules, especially under noise and counterfactual scenarios, and that CoT alone is insufficient for reliable rule application. The authors propose Inferential Rule-Following Tuning (IRFT) using synthetic data (StringGame) to teach models to identify and execute the correct rule, achieving broad improvements without sacrificing standard instruction-following. Together, RuleBench and IRFT establish a framework for measuring and enhancing a crucial cognitive skill for safer, more capable AI agents.

Abstract

Although Large Language Models (LLMs) have demonstrated strong ability, they are further supposed to be controlled and guided by in real-world scenarios to be safe, accurate, and intelligent. This demands the possession of capability of LLMs. However, no prior work has made a clear evaluation of the inferential rule-following capability of LLMs. Previous studies that try to evaluate the inferential rule-following capability of LLMs fail to distinguish the inferential rule-following scenarios from the instruction-following scenarios. Therefore, this paper first clarifies the concept of inferential rule-following and proposes a comprehensive benchmark, RuleBench, to evaluate a diversified range of inferential rule-following abilities. Our experimental results on a variety of LLMs show that they are still limited in following rules. Our analysis based on the evaluation results provides insights into the improvements for LLMs toward a better inferential rule-following intelligent agent. We further propose Inferential Rule-Following Tuning (IRFT). The experimental results show that through IRFT, LLMs can learn abstract rule-following abilities from purely synthetic data and then generalize to RuleBench. The data and code can be found at: https://anonymous.4open.science/r/llm-rule-following-B3E3/

Beyond Instruction Following: Evaluating Inferential Rule Following of Large Language Models

TL;DR

This work defines inferential rule-following as a distinct capability from instruction-following and introduces RuleBench, a multi-domain benchmark to evaluate LLMs' ability to trigger and apply abstract rules. It analyzes diverse open- and closed-source LLMs across rule settings, revealing that current models struggle with inferential rules, especially under noise and counterfactual scenarios, and that CoT alone is insufficient for reliable rule application. The authors propose Inferential Rule-Following Tuning (IRFT) using synthetic data (StringGame) to teach models to identify and execute the correct rule, achieving broad improvements without sacrificing standard instruction-following. Together, RuleBench and IRFT establish a framework for measuring and enhancing a crucial cognitive skill for safer, more capable AI agents.

Abstract

Although Large Language Models (LLMs) have demonstrated strong ability, they are further supposed to be controlled and guided by in real-world scenarios to be safe, accurate, and intelligent. This demands the possession of capability of LLMs. However, no prior work has made a clear evaluation of the inferential rule-following capability of LLMs. Previous studies that try to evaluate the inferential rule-following capability of LLMs fail to distinguish the inferential rule-following scenarios from the instruction-following scenarios. Therefore, this paper first clarifies the concept of inferential rule-following and proposes a comprehensive benchmark, RuleBench, to evaluate a diversified range of inferential rule-following abilities. Our experimental results on a variety of LLMs show that they are still limited in following rules. Our analysis based on the evaluation results provides insights into the improvements for LLMs toward a better inferential rule-following intelligent agent. We further propose Inferential Rule-Following Tuning (IRFT). The experimental results show that through IRFT, LLMs can learn abstract rule-following abilities from purely synthetic data and then generalize to RuleBench. The data and code can be found at: https://anonymous.4open.science/r/llm-rule-following-B3E3/
Paper Structure (28 sections, 1 equation, 20 figures, 5 tables)

This paper contains 28 sections, 1 equation, 20 figures, 5 tables.

Figures (20)

  • Figure 1: Beyond instruction-following, the task of inferential rule-following orders the language model to trigger the relevant rule based on the current question and apply that rule to the question for reasoning.
  • Figure 2: The inferential rule-following capabilities of some open-source and closed-source LLMs. The inferential rule-following capabilities of LLMs are categorized into 5 dimensions: Triggering Rules, Applying Rules, Executing Rules, Following Formal Rules, and Following Counterfactual Rules.
  • Figure 3: The different settings evaluated in RuleBench, including rule quantities, rule forms, Chain-of-Thought (CoT) in applying rules, counterfactual rules, and behavior analysis.
  • Figure 4: The inferential rule-following performance of LLMs under different rule quantities.
  • Figure 5: The inferential rule-following performance of LLMs when applying rules with or without using Chain-of-Thought. The dashed box indicates the improvement (positive or negative) of w/ CoT over w/o CoT.
  • ...and 15 more figures