Table of Contents
Fetching ...

MAIN: Mutual Alignment Is Necessary for instruction tuning

Fanyi Yang, Jianfeng Liu, Xin Zhang, Haoyu Liu, Xixin Cao, Yuefeng Zhan, Hao Sun, Weiwei Deng, Feng Sun, Qi Zhang

TL;DR

This work identifies instruction-response alignment as a critical driver of instruction-tuning quality and introduces the Mutual Alignment Framework (MAIN), which jointly optimizes instruction and response generation in a bidirectional, EM-inspired manner. MAIN uses seed data plus large unlabeled responses, with forward and reverse models guiding data synthesis, dynamic weighting to balance synthetic and seed inputs, and mutual filtering to curate high-quality pairs. Across LLaMA-2-7B, Mistral, and Qwen, MAIN achieves state-of-the-art results on benchmarks for output preference, instruction-following, and reasoning, with demonstrated robustness across architectures and multilingual settings. The results highlight alignment as a key lever for generalizable instruction tuning and provide a scalable pipeline for generating high-quality instruction-response data.

Abstract

Instruction tuning has empowered large language models (LLMs) to achieve remarkable performance, yet its success heavily depends on the availability of large-scale, high-quality instruction-response pairs. To meet this demand, various methods have been developed to synthesize data at scale. However, current methods for scaling up data generation often overlook a crucial aspect: the alignment between instructions and responses. We hypothesize that the quality of instruction-response pairs is determined not by the individual quality of each component, but by the degree of mutual alignment. To address this, we propose a Mutual Alignment Framework (MAIN) which enforces coherence between instructions and responses through mutual constraints. We demonstrate that MAIN generalizes well across model architectures and sizes, achieving state-of-the-art performance on LLaMA, Mistral, and Qwen models across diverse benchmarks. This work underscores the critical role of instruction-response alignment in enabling generalizable and high-quality instruction tuning for LLMs. All code is available from our repository.

MAIN: Mutual Alignment Is Necessary for instruction tuning

TL;DR

This work identifies instruction-response alignment as a critical driver of instruction-tuning quality and introduces the Mutual Alignment Framework (MAIN), which jointly optimizes instruction and response generation in a bidirectional, EM-inspired manner. MAIN uses seed data plus large unlabeled responses, with forward and reverse models guiding data synthesis, dynamic weighting to balance synthetic and seed inputs, and mutual filtering to curate high-quality pairs. Across LLaMA-2-7B, Mistral, and Qwen, MAIN achieves state-of-the-art results on benchmarks for output preference, instruction-following, and reasoning, with demonstrated robustness across architectures and multilingual settings. The results highlight alignment as a key lever for generalizable instruction tuning and provide a scalable pipeline for generating high-quality instruction-response data.

Abstract

Instruction tuning has empowered large language models (LLMs) to achieve remarkable performance, yet its success heavily depends on the availability of large-scale, high-quality instruction-response pairs. To meet this demand, various methods have been developed to synthesize data at scale. However, current methods for scaling up data generation often overlook a crucial aspect: the alignment between instructions and responses. We hypothesize that the quality of instruction-response pairs is determined not by the individual quality of each component, but by the degree of mutual alignment. To address this, we propose a Mutual Alignment Framework (MAIN) which enforces coherence between instructions and responses through mutual constraints. We demonstrate that MAIN generalizes well across model architectures and sizes, achieving state-of-the-art performance on LLaMA, Mistral, and Qwen models across diverse benchmarks. This work underscores the critical role of instruction-response alignment in enabling generalizable and high-quality instruction tuning for LLMs. All code is available from our repository.

Paper Structure

This paper contains 54 sections, 8 equations, 5 figures, 12 tables, 1 algorithm.

Figures (5)

  • Figure 1: This figure illustrates a common interaction where a person and a dog adjust their behaviors to align instruction with response, evolving through repeated interactions to achieve mutual understanding.
  • Figure 2: An overview of the data synthesis process, including mutual alignment, data augmentation, and data curation, aimed at creating high-quality, well-aligned instruction-response pairs from both seed and unlabeled data.
  • Figure 3: An overview of our method for iteratively aligning instructions and responses through mutual optimization.
  • Figure 4: Method Comparison for Instruction Generation: A Case Study on the Effectiveness of Reverse Model Approaches in Aligning Instructions with Responses
  • Figure 5: Evaluation of dynamic weighting strategies on LLaMA-2-7B training, comparing fixed and adaptive $\alpha$ values using the Falcon-RefinedWeb dataset, with performance assessed on AlpacaEval and IFEval.