Table of Contents
Fetching ...

Order Independence With Finetuning

Katrina Brown, Reid McIlroy

TL;DR

The paper tackles the problem of order dependence in large language models during multiple-choice QA by integrating Set-Based Prompting (SBP) into finetuning. It introduces a margin-based contrastive objective to align SBP-formatted prompts with the model’s training distribution, mitigating performance drops seen when SBP is applied at inference time alone. Across in-distribution MMLU and out-of-distribution CSQA and ARC Challenge, SBP finetuning significantly boosts order-invariant accuracy while preserving overall language modeling capabilities, with margin-based training outperforming standard cross-entropy. The work demonstrates the promise of order-invariant modeling for fairer, more reliable LLMs and suggests directions for extending SBP to other tasks and configurations.

Abstract

Large language models (LLMs) demonstrate remarkable performance on many NLP tasks, yet often exhibit order dependence: simply reordering semantically identical tokens (e.g., answer choices in multiple-choice questions) can lead to inconsistent predictions. Recent work proposes Set-Based Prompting (SBP) as a way to remove order information from designated token subsets, thereby mitigating positional biases. However, applying SBP on base models induces an out-of-distribution input format, which can degrade in-distribution performance. We introduce a fine-tuning strategy that integrates SBP into the training process, "pulling" these set-formatted prompts closer to the model's training manifold. We show that SBP can be incorporated into a model via fine-tuning. Our experiments on in-distribution (MMLU) and out-of-distribution (CSQA, ARC Challenge) multiple-choice tasks show that SBP fine-tuning significantly improves accuracy and robustness to answer-order permutations, all while preserving broader language modeling capabilities. We discuss the broader implications of order-invariant modeling and outline future directions for building fairer, more consistent LLMs.

Order Independence With Finetuning

TL;DR

The paper tackles the problem of order dependence in large language models during multiple-choice QA by integrating Set-Based Prompting (SBP) into finetuning. It introduces a margin-based contrastive objective to align SBP-formatted prompts with the model’s training distribution, mitigating performance drops seen when SBP is applied at inference time alone. Across in-distribution MMLU and out-of-distribution CSQA and ARC Challenge, SBP finetuning significantly boosts order-invariant accuracy while preserving overall language modeling capabilities, with margin-based training outperforming standard cross-entropy. The work demonstrates the promise of order-invariant modeling for fairer, more reliable LLMs and suggests directions for extending SBP to other tasks and configurations.

Abstract

Large language models (LLMs) demonstrate remarkable performance on many NLP tasks, yet often exhibit order dependence: simply reordering semantically identical tokens (e.g., answer choices in multiple-choice questions) can lead to inconsistent predictions. Recent work proposes Set-Based Prompting (SBP) as a way to remove order information from designated token subsets, thereby mitigating positional biases. However, applying SBP on base models induces an out-of-distribution input format, which can degrade in-distribution performance. We introduce a fine-tuning strategy that integrates SBP into the training process, "pulling" these set-formatted prompts closer to the model's training manifold. We show that SBP can be incorporated into a model via fine-tuning. Our experiments on in-distribution (MMLU) and out-of-distribution (CSQA, ARC Challenge) multiple-choice tasks show that SBP fine-tuning significantly improves accuracy and robustness to answer-order permutations, all while preserving broader language modeling capabilities. We discuss the broader implications of order-invariant modeling and outline future directions for building fairer, more consistent LLMs.

Paper Structure

This paper contains 28 sections, 6 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Visualization of order dependency in Llama 2, 7B, when asked to choose the best among three resumes. In variant (a) the default ordering leads to a correct answer. Variant (b) reverses the answer choices and results in an incorrect response, while variant (c) applies Set-Based Prompting to neutralize ordering effects, restoring the correct answer.
  • Figure 2: Question answering accuracy for Llama-2-7b-hf under 4! reorderings for Standard Prompting on the base model, and for Set-Based Prompting on the base and finetuned models. Note on the legend that contrastive vs cross-entropy indicates the loss function used during finetuning to obtain the finetuned model, while treatment vs control indicates whether the model was finetuned on Set-Based Prompting (treatment) vs standard order dependent (control) formatted data.
  • Figure 3: On CSQA questions (which were unseen in the data used for finetuning), Set-Based Prompting accuracy post-fine-tuning significantly exceeds pre-fine-tuning best-of-2 accuracy. Note on the x-axis that the chat suffix indicates testing on Llama-2-7b-chat while the absence of this suffix indicates testing on Llama-2-7b.
  • Figure 4: Question answering accuracy under 4! reorderings for standard prompting and SBP, pre- and post-fine-tuning on Llama-2-7b-chat-hf.
  • Figure 5: Finetuning on Set-Based Prompting data yields similar accuracy gains for both MMLU and ARC as for CSQA.
  • ...and 1 more figures