Table of Contents
Fetching ...

Do LLMs "know" internally when they follow instructions?

Juyeon Heo, Christina Heinze-Deml, Oussama Elachqar, Kwan Ho Ryan Chan, Shirley Ren, Udhay Nallasamy, Andy Miller, Jaya Narain

TL;DR

The paper investigates whether LLM internal representations encode a measurable signal for instruction-following success. By applying linear probes to IFEval-simple data, it identifies an instruction-following dimension in the input embedding space that generalizes across unseen tasks but not unseen instruction types. The authors validate this dimension through representation engineering, moving failure cases toward success by adjusting representations along this direction without degrading output quality, and find the dimension is closely tied to prompt phrasing rather than task difficulty. These findings illuminate the internal mechanics of instruction adherence and suggest practical avenues for building more reliable LLM agents through targeted prompting and representation-level interventions.

Abstract

Instruction-following is crucial for building AI agents with large language models (LLMs), as these models must adhere strictly to user-provided constraints and guidelines. However, LLMs often fail to follow even simple and clear instructions. To improve instruction-following behavior and prevent undesirable outputs, a deeper understanding of how LLMs' internal states relate to these outcomes is required. In this work, we investigate whether LLMs encode information in their representations that correlate with instruction-following success - a property we term knowing internally. Our analysis identifies a direction in the input embedding space, termed the instruction-following dimension, that predicts whether a response will comply with a given instruction. We find that this dimension generalizes well across unseen tasks but not across unseen instruction types. We demonstrate that modifying representations along this dimension improves instruction-following success rates compared to random changes, without compromising response quality. Further investigation reveals that this dimension is more closely related to the phrasing of prompts rather than the inherent difficulty of the task or instructions. This work provides insight into the internal workings of LLMs' instruction-following, paving the way for reliable LLM agents.

Do LLMs "know" internally when they follow instructions?

TL;DR

The paper investigates whether LLM internal representations encode a measurable signal for instruction-following success. By applying linear probes to IFEval-simple data, it identifies an instruction-following dimension in the input embedding space that generalizes across unseen tasks but not unseen instruction types. The authors validate this dimension through representation engineering, moving failure cases toward success by adjusting representations along this direction without degrading output quality, and find the dimension is closely tied to prompt phrasing rather than task difficulty. These findings illuminate the internal mechanics of instruction adherence and suggest practical avenues for building more reliable LLM agents through targeted prompting and representation-level interventions.

Abstract

Instruction-following is crucial for building AI agents with large language models (LLMs), as these models must adhere strictly to user-provided constraints and guidelines. However, LLMs often fail to follow even simple and clear instructions. To improve instruction-following behavior and prevent undesirable outputs, a deeper understanding of how LLMs' internal states relate to these outcomes is required. In this work, we investigate whether LLMs encode information in their representations that correlate with instruction-following success - a property we term knowing internally. Our analysis identifies a direction in the input embedding space, termed the instruction-following dimension, that predicts whether a response will comply with a given instruction. We find that this dimension generalizes well across unseen tasks but not across unseen instruction types. We demonstrate that modifying representations along this dimension improves instruction-following success rates compared to random changes, without compromising response quality. Further investigation reveals that this dimension is more closely related to the phrasing of prompts rather than the inherent difficulty of the task or instructions. This work provides insight into the internal workings of LLMs' instruction-following, paving the way for reliable LLM agents.

Paper Structure

This paper contains 23 sections, 6 figures, 12 tables.

Figures (6)

  • Figure 1: Overview of our paper. Left: Success and failure cases in a personalized AI fitness planner. The task is to generate a warm-up plan while avoiding knee-required positions. The success case follows the instruction, while the failure case violates it. Middle: Linear probing is applied to analyze internal representations from success and failure cases, identifying the instruction-following dimension. The probe is tested on unseen tasks (e.g., writing a CV) and instruction types (e.g., include/exclude keywords). Right: Representation engineering is used to shift failure cases into success by adjusting the representations along the instruction-following dimension, improving adherence without compromising task quality.
  • Figure 2: PCA plot of first token representations from early layers across four LLMs. PCA is fitted on the training split and visualized on the test split (unseen tasks). The PCA shows separability, suggesting the consistent capture of the instruction-following dimension across tasks. The analysis includes three instruction types from the keyword category in IFEval-simple. Additional PCA results for all five instruction types across different categories are provided in Appendix Figure \ref{['fig:pca_appendix']}.
  • Figure 3: Transition metric for Representation Engineering on the last layer of four models Success rate (SR) only on high quality responses in task execution (scoring above 7 by GPT-4, scale from 0 to 9). The Success conversion ratio (SCR) indicates the proportion of originally failed responses that became successful after modification, while Success preservation ratio (SPR) reflects the proportion of originally successful responses that remained successful.
  • Figure 4: Cosine similarity alignment for modified data in the 'forbidden keyword' instruction type across two models (Llama-2-7b-chat (Left) and Llama-2-13b-chat (Right)). The figure shows the cosine similarity between the instruction-following dimension and the difference vector (computed as the difference between the original prompt's representation and the average representation of five modified prompts) across 20 sampled prompts. Modifications include changes in task familiarity, instruction difficulty, and phrasing. The results indicate that phrasing modifications align more closely with the instruction-following dimension, suggesting that how prompts are phrased plays a crucial role in determining instruction adherence.
  • Figure 5: RE example An illustrative example of modified responses. In this case, the task was to write a resume with the instruction to include three specific keywords. The original response only included one keyword, whereas the modified response, guided by the instruction-following direction, successfully incorporated all three keywords, demonstrating the effectiveness of RE in enhancing instruction adherence.
  • ...and 1 more figures