LexInstructEval: Lexical Instruction Following Evaluation for Large Language Models

Huimin Ren; Yan Liang; Baiqiao Su; Chaobo Sun; Hengtong Lu; Kaike Zhang; Chen Wei

LexInstructEval: Lexical Instruction Following Evaluation for Large Language Models

Huimin Ren, Yan Liang, Baiqiao Su, Chaobo Sun, Hengtong Lu, Kaike Zhang, Chen Wei

TL;DR

LexInstructEval tackles the challenge of evaluating fine-grained lexical instruction following in LLMs by introducing a formal grammar that encodes instructions as the $\langle \texttt{Procedure, Relation, Value} \rangle$ triplet and a transparent automated verification engine. It provides a bilingual English-Chinese benchmark assembled via a four-stage data-construction pipeline with human validation, along with open-source tooling for objective adjudication based on rigid, universal-quantiﬁcation checks. Experimental results across diverse open- and closed-source models show that instruction depth, not just quantity, governs performance, with noticeable cross-lingual gaps and high agreement with expert judgments. This framework offers a scalable, low-cost means to benchmark and guide improvements in controllability and reliability of LLMs. It is poised to inform future developments in precise instruction adherence and multilingual evaluation.”

Abstract

The ability of Large Language Models (LLMs) to precisely follow complex and fine-grained lexical instructions is a cornerstone of their utility and controllability. However, evaluating this capability remains a significant challenge. Current methods either rely on subjective and costly human evaluation or on automated LLM-as-a-judge systems, which suffer from inherent biases and unreliability. Existing programmatic benchmarks, while objective, often lack the expressiveness to test intricate, compositional constraints at a granular level. To address these limitations, we introduce LexInstructEval, a new benchmark and evaluation framework for fine-grained lexical instruction following. Our framework is built upon a formal, rule-based grammar that deconstructs complex instructions into a canonical <Procedure, Relation, Value> triplet. This grammar enables the systematic generation of a diverse dataset through a multi-stage, human-in-the-loop pipeline and facilitates objective verification via a transparent, programmatic engine. We release our dataset and open-source evaluation tools to facilitate further research into the controllability and reliability of LLMs.

LexInstructEval: Lexical Instruction Following Evaluation for Large Language Models

TL;DR

LexInstructEval tackles the challenge of evaluating fine-grained lexical instruction following in LLMs by introducing a formal grammar that encodes instructions as the

triplet and a transparent automated verification engine. It provides a bilingual English-Chinese benchmark assembled via a four-stage data-construction pipeline with human validation, along with open-source tooling for objective adjudication based on rigid, universal-quantiﬁcation checks. Experimental results across diverse open- and closed-source models show that instruction depth, not just quantity, governs performance, with noticeable cross-lingual gaps and high agreement with expert judgments. This framework offers a scalable, low-cost means to benchmark and guide improvements in controllability and reliability of LLMs. It is poised to inform future developments in precise instruction adherence and multilingual evaluation.”

LexInstructEval: Lexical Instruction Following Evaluation for Large Language Models

TL;DR

Abstract

LexInstructEval: Lexical Instruction Following Evaluation for Large Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)