Pay Attention to What Matters

Pedro Luiz Silva; Antonio de Domenico; Ali Maatouk; Fadhel Ayed

Pay Attention to What Matters

Pedro Luiz Silva, Antonio de Domenico, Ali Maatouk, Fadhel Ayed

TL;DR

This work introduces a simple and effective method, which is named GUIDE, that mechanistically increases attention scores in instruction tokens and presents Influence, a novel metric that highlights how the user's instructions propagate through the transformer layers and impact the LLM output.

Abstract

Despite the remarkable success of Large Language Models (LLMs), they still exhibit a limited capability to align their outputs to the user instructions. In this work, we introduce a simple and effective method, which we name GUIDE, that mechanistically increases attention scores in instruction tokens. To support this operation, we present Influence, a novel metric that highlights how the user's instructions propagate through the transformer layers and impact the LLM output. Our results show that GUIDE improves the accuracy of following instructions 29.4 % to 60.4%, outperforming natural prompting alternatives and Supervised Fine-Tuning up to 1M tokens.

Pay Attention to What Matters

TL;DR

Abstract

Paper Structure (27 sections, 13 equations, 7 figures, 5 tables)

This paper contains 27 sections, 13 equations, 7 figures, 5 tables.

Introduction
Related work
GUIDE: (Guided Understanding with Instruction-Driven Enhancements)
Description of the method
Calibrating GUIDE
Influence
Initialization:
Propagation Rules:
Experiments
Description
Summarization in French
JSON Generation
Influence
Results
Summarization in French
...and 12 more sections

Figures (7)

Figure 1: Schema of PayAttentionPipeline.
Figure 2: (a) : Distribution of ratio between norms of token embeddings before and after attention; (b): Attention rollout ($\text{R}_{\mathcal{U}}(E_k^{(\ell)})$) and Influence ($\text{I}_{\mathcal{U}}(E_k^{(\ell)})$) trends in log scale over context length ($k$) in intermediate and final layers ($\ell = 16$ and $\ell = 32$). The instruction tokens $\mathcal{U}$ were situated on the beginning of the prompt.
Figure 3: Log of the influence across different layers. This illustrates that with an appropriately chosen $\Delta$, GUIDE can effectively replicate—and even further amplify—semantically intuitive instructions, like using uppercase text.
Figure 4: Summarization results: (a) GUIDE outperforms prompt engineering techniques like using uppercase text, and (b) GUIDE demonstrates greater accuracy than SFT up to 1 million training tokens.
Figure 5: Heatmap of scores in a needle in haystack test.
...and 2 more figures

Pay Attention to What Matters

TL;DR

Abstract

Pay Attention to What Matters

Authors

TL;DR

Abstract

Table of Contents

Figures (7)