Table of Contents
Fetching ...

Addressing Order Sensitivity of In-Context Demonstration Examples in Causal Language Models

Yanzheng Xiang, Hanqi Yan, Lin Gui, Yulan He

TL;DR

Addressing the order sensitivity of in-context demonstration examples in causal language models, the paper shows that autoregressive masks in CausalLMs yield position-dependent representations that make performance highly permutation-dependent. It introduces InfoAC, an unsupervised LoRA-based fine-tuning framework with two losses: Information Augmentation, a contrastive objective aligning in-context example representations across positions to the end-of-demonstration, and Consistency Enhancement, a loss enforcing consistent predictions across permutations. Evaluations across five benchmarks (SST-5, SST-2, QQP, Sequence Next Term, Round Number) and multiple backbones demonstrate that InfoAC reduces order sensitivity and generalizes well across different candidate pools and varying numbers of demonstrations. This work advances reliable in-context learning for CausalLMs and offers practical guidance for improving permutation robustness in real-world applications.

Abstract

In-context learning has become a popular paradigm in natural language processing. However, its performance can be significantly influenced by the order of in-context demonstration examples. In this paper, we found that causal language models (CausalLMs) are more sensitive to this order compared to prefix language models (PrefixLMs). We attribute this phenomenon to the auto-regressive attention masks within CausalLMs, which restrict each token from accessing information from subsequent tokens. This results in different receptive fields for samples at different positions, thereby leading to representation disparities across positions. To tackle this challenge, we introduce an unsupervised fine-tuning method, termed the Information-Augmented and Consistency-Enhanced approach. This approach utilizes contrastive learning to align representations of in-context examples across different positions and introduces a consistency loss to ensure similar representations for inputs with different permutations. This enhances the model's predictive consistency across permutations. Experimental results on five benchmarks suggest that our proposed method can reduce the sensitivity of CausalLMs to the order of in-context examples and exhibit robust generalizability, particularly when demonstrations are sourced from a candidate pool different from that used in the training phase, or when the number of in-context examples differs from what is used during training.

Addressing Order Sensitivity of In-Context Demonstration Examples in Causal Language Models

TL;DR

Addressing the order sensitivity of in-context demonstration examples in causal language models, the paper shows that autoregressive masks in CausalLMs yield position-dependent representations that make performance highly permutation-dependent. It introduces InfoAC, an unsupervised LoRA-based fine-tuning framework with two losses: Information Augmentation, a contrastive objective aligning in-context example representations across positions to the end-of-demonstration, and Consistency Enhancement, a loss enforcing consistent predictions across permutations. Evaluations across five benchmarks (SST-5, SST-2, QQP, Sequence Next Term, Round Number) and multiple backbones demonstrate that InfoAC reduces order sensitivity and generalizes well across different candidate pools and varying numbers of demonstrations. This work advances reliable in-context learning for CausalLMs and offers practical guidance for improving permutation robustness in real-world applications.

Abstract

In-context learning has become a popular paradigm in natural language processing. However, its performance can be significantly influenced by the order of in-context demonstration examples. In this paper, we found that causal language models (CausalLMs) are more sensitive to this order compared to prefix language models (PrefixLMs). We attribute this phenomenon to the auto-regressive attention masks within CausalLMs, which restrict each token from accessing information from subsequent tokens. This results in different receptive fields for samples at different positions, thereby leading to representation disparities across positions. To tackle this challenge, we introduce an unsupervised fine-tuning method, termed the Information-Augmented and Consistency-Enhanced approach. This approach utilizes contrastive learning to align representations of in-context examples across different positions and introduces a consistency loss to ensure similar representations for inputs with different permutations. This enhances the model's predictive consistency across permutations. Experimental results on five benchmarks suggest that our proposed method can reduce the sensitivity of CausalLMs to the order of in-context examples and exhibit robust generalizability, particularly when demonstrations are sourced from a candidate pool different from that used in the training phase, or when the number of in-context examples differs from what is used during training.
Paper Structure (30 sections, 16 equations, 4 figures, 11 tables)

This paper contains 30 sections, 16 equations, 4 figures, 11 tables.

Figures (4)

  • Figure 1: The heatmap visualizes the similarities in representations of a specific token within a sample from the last layer outputs across different positions for Llama2-chat-7B.
  • Figure 2: The heat map visualises the similarities in representations of a specific token within a sample from the last encoder layer outputs across different positions for Flan-T5-XL.
  • Figure 3: The overview of our proposed InfoAC. We adopt contrastive learning (Left) to align the representation of a sample, $S_1$, as derived from model $M$, with the representations of $S_1$ when it is positioned at the end of the sequence derived from a referenced model $M_r$. We also ensure that the hidden representations preceding the classification head are similar when positioned at various locations, resulting in consistent outputs (Right).
  • Figure A1: The heatmap visualizes the similarities in representations of a specific token within a sample from the last layer outputs across different positions for Llama2-chat-7B after fine-tuning with InfoAC.