LexInstructEval: Lexical Instruction Following Evaluation for Large Language Models
Huimin Ren, Yan Liang, Baiqiao Su, Chaobo Sun, Hengtong Lu, Kaike Zhang, Chen Wei
TL;DR
LexInstructEval tackles the challenge of evaluating fine-grained lexical instruction following in LLMs by introducing a formal grammar that encodes instructions as the $\langle \texttt{Procedure, Relation, Value} \rangle$ triplet and a transparent automated verification engine. It provides a bilingual English-Chinese benchmark assembled via a four-stage data-construction pipeline with human validation, along with open-source tooling for objective adjudication based on rigid, universal-quantification checks. Experimental results across diverse open- and closed-source models show that instruction depth, not just quantity, governs performance, with noticeable cross-lingual gaps and high agreement with expert judgments. This framework offers a scalable, low-cost means to benchmark and guide improvements in controllability and reliability of LLMs. It is poised to inform future developments in precise instruction adherence and multilingual evaluation.”
Abstract
The ability of Large Language Models (LLMs) to precisely follow complex and fine-grained lexical instructions is a cornerstone of their utility and controllability. However, evaluating this capability remains a significant challenge. Current methods either rely on subjective and costly human evaluation or on automated LLM-as-a-judge systems, which suffer from inherent biases and unreliability. Existing programmatic benchmarks, while objective, often lack the expressiveness to test intricate, compositional constraints at a granular level. To address these limitations, we introduce LexInstructEval, a new benchmark and evaluation framework for fine-grained lexical instruction following. Our framework is built upon a formal, rule-based grammar that deconstructs complex instructions into a canonical <Procedure, Relation, Value> triplet. This grammar enables the systematic generation of a diverse dataset through a multi-stage, human-in-the-loop pipeline and facilitates objective verification via a transparent, programmatic engine. We release our dataset and open-source evaluation tools to facilitate further research into the controllability and reliability of LLMs.
