Table of Contents
Fetching ...

Improving the Robustness of Large Language Models via Consistency Alignment

Yukun Zhao, Lingyong Yan, Weiwei Sun, Guoliang Xing, Shuaiqiang Wang, Chong Meng, Zhicong Cheng, Zhaochun Ren, Dawei Yin

TL;DR

This paper quantitatively defines the inconsistency problem and proposes a two-stage training framework consisting of instruction-augmented supervised fine-tuning and consistency alignment training that helps a model generalize on following instructions via similar instruction augmentations.

Abstract

Large language models (LLMs) have shown tremendous success in following user instructions and generating helpful responses. Nevertheless, their robustness is still far from optimal, as they may generate significantly inconsistent responses due to minor changes in the verbalized instructions. Recent literature has explored this inconsistency issue, highlighting the importance of continued improvement in the robustness of response generation. However, systematic analysis and solutions are still lacking. In this paper, we quantitatively define the inconsistency problem and propose a two-stage training framework consisting of instruction-augmented supervised fine-tuning and consistency alignment training. The first stage helps a model generalize on following instructions via similar instruction augmentations. In the second stage, we improve the diversity and help the model understand which responses are more aligned with human expectations by differentiating subtle differences in similar responses. The training process is accomplished by self-rewards inferred from the trained model at the first stage without referring to external human preference resources. We conduct extensive experiments on recent publicly available LLMs on instruction-following tasks and demonstrate the effectiveness of our training framework.

Improving the Robustness of Large Language Models via Consistency Alignment

TL;DR

This paper quantitatively defines the inconsistency problem and proposes a two-stage training framework consisting of instruction-augmented supervised fine-tuning and consistency alignment training that helps a model generalize on following instructions via similar instruction augmentations.

Abstract

Large language models (LLMs) have shown tremendous success in following user instructions and generating helpful responses. Nevertheless, their robustness is still far from optimal, as they may generate significantly inconsistent responses due to minor changes in the verbalized instructions. Recent literature has explored this inconsistency issue, highlighting the importance of continued improvement in the robustness of response generation. However, systematic analysis and solutions are still lacking. In this paper, we quantitatively define the inconsistency problem and propose a two-stage training framework consisting of instruction-augmented supervised fine-tuning and consistency alignment training. The first stage helps a model generalize on following instructions via similar instruction augmentations. In the second stage, we improve the diversity and help the model understand which responses are more aligned with human expectations by differentiating subtle differences in similar responses. The training process is accomplished by self-rewards inferred from the trained model at the first stage without referring to external human preference resources. We conduct extensive experiments on recent publicly available LLMs on instruction-following tasks and demonstrate the effectiveness of our training framework.
Paper Structure (33 sections, 7 equations, 5 figures, 7 tables)

This paper contains 33 sections, 7 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: GPT-4 generates inconsistent responses for the identical task.
  • Figure 2: The consistency metrics of recent LLMs.
  • Figure 3: Our consistency alignment training framework.
  • Figure 4: The performance of different $\lambda$ for our training method on the test set I .
  • Figure 5: The performance of SFT (IA) across varying number of instructions for each input on the test set I .