Table of Contents
Fetching ...

TAIA: Large Language Models are Out-of-Distribution Data Learners

Shuyang Jiang, Yusheng Liao, Ya Zhang, Yanfeng Wang, Yu Wang

TL;DR

This work re-evaluated the Transformer architecture and discovered that not all parameter updates during fine-tuning contribute positively to downstream performance, and proposes an effective inference-time intervention method: Training All parameters but Inferring with only Attention (\trainallInfAttn).

Abstract

Fine-tuning on task-specific question-answer pairs is a predominant method for enhancing the performance of instruction-tuned large language models (LLMs) on downstream tasks. However, in certain specialized domains, such as healthcare or harmless content generation, it is nearly impossible to obtain a large volume of high-quality data that matches the downstream distribution. To improve the performance of LLMs in data-scarce domains with domain-mismatched data, we re-evaluated the Transformer architecture and discovered that not all parameter updates during fine-tuning contribute positively to downstream performance. Our analysis reveals that within the self-attention and feed-forward networks, only the fine-tuned attention parameters are particularly beneficial when the training set's distribution does not fully align with the test set. Based on this insight, we propose an effective inference-time intervention method: Training All parameters but Inferring with only Attention (\trainallInfAttn). We empirically validate \trainallInfAttn using two general instruction-tuning datasets and evaluate it on seven downstream tasks involving math, reasoning, and knowledge understanding across LLMs of different parameter sizes and fine-tuning techniques. Our comprehensive experiments demonstrate that \trainallInfAttn achieves superior improvements compared to both the fully fine-tuned model and the base model in most scenarios, with significant performance gains. The high tolerance of \trainallInfAttn to data mismatches makes it resistant to jailbreaking tuning and enhances specialized tasks using general data. Code is available in \url{https://github.com/pixas/TAIA_LLM}.

TAIA: Large Language Models are Out-of-Distribution Data Learners

TL;DR

This work re-evaluated the Transformer architecture and discovered that not all parameter updates during fine-tuning contribute positively to downstream performance, and proposes an effective inference-time intervention method: Training All parameters but Inferring with only Attention (\trainallInfAttn).

Abstract

Fine-tuning on task-specific question-answer pairs is a predominant method for enhancing the performance of instruction-tuned large language models (LLMs) on downstream tasks. However, in certain specialized domains, such as healthcare or harmless content generation, it is nearly impossible to obtain a large volume of high-quality data that matches the downstream distribution. To improve the performance of LLMs in data-scarce domains with domain-mismatched data, we re-evaluated the Transformer architecture and discovered that not all parameter updates during fine-tuning contribute positively to downstream performance. Our analysis reveals that within the self-attention and feed-forward networks, only the fine-tuned attention parameters are particularly beneficial when the training set's distribution does not fully align with the test set. Based on this insight, we propose an effective inference-time intervention method: Training All parameters but Inferring with only Attention (\trainallInfAttn). We empirically validate \trainallInfAttn using two general instruction-tuning datasets and evaluate it on seven downstream tasks involving math, reasoning, and knowledge understanding across LLMs of different parameter sizes and fine-tuning techniques. Our comprehensive experiments demonstrate that \trainallInfAttn achieves superior improvements compared to both the fully fine-tuned model and the base model in most scenarios, with significant performance gains. The high tolerance of \trainallInfAttn to data mismatches makes it resistant to jailbreaking tuning and enhances specialized tasks using general data. Code is available in \url{https://github.com/pixas/TAIA_LLM}.
Paper Structure (48 sections, 24 equations, 10 figures, 15 tables)

This paper contains 48 sections, 24 equations, 10 figures, 15 tables.

Figures (10)

  • Figure 1: Performance comparison of various fine-tuning methods under three OOD data mixing scenarios. The target domain is medical knowledge, using Chinese subset of MMedBench qiu2024towards as the in-domain training dataset. (a) The dataset is mixed with medical OOD data from CMExam liu2023benchmarking, maintaining a total dataset size of 20k; (b) The dataset is mixed with general OOD data from CoT-Collection kim-etal-2023-cot, also keeping the total dataset size at 20k; (c) The dataset includes general OOD data from CoT-Collection, while the size of the in-domain training dataset remains at 20k. As the proportion of OOD data increases, the performance of the vanilla fine-tuning declines significantly, whereas TAIA manages to sustain robust performance in the target domain (details in Appendix \ref{['app: data mixing experiments']}).
  • Figure 2: Comparison between different fine-tuning and inference methods. Parameters colored with green and yellow represent models finetuned with in-domain and out-of-distribution data, respectively. "ID" and "OOD" represents in-distribution and out-of-distribution, respectively. When we train in-domain data (colored as green) and out-of-domain data (colored as yellow) and evaluate in in-domain test sets and out-of-domain test sets, respectively (The second row; fine-tuning). The vanilla fine-tuning method can only perform well when trained on ID data and evaluated in ID test sets. Compared to vanilla tuning, TAIA can perform generally well on both types of test sets when given OOD data. As a similar approach that only trains attention, TOA (Train-only-attention) performs badly on both types of evaluation sets as it loses sufficient exploration of optimal parameter groups.
  • Figure 3: Performance of TOA and TAIA with the layer-wise FFN LoRA. All models are equipped with attention LoRA at each layer and fine-tuned on a corpus mixture with 50% OOD data.
  • Figure 4: (a)Average performance with different sizes of fine-tuning datasets; (b) The few-shot performance on MATH; (c) The layer-wise residual rank of the hidden states on MATH.
  • Figure 5: (a) The performance of LLMs fine-tuned with three specific downstream datasets on C-Eval and (b) the cosine similarity distribution of the hidden layer. The cosine similarity is calculated as the average distance between the output hidden state of three fine-tuned models. TAIA achieves the best performance on C-Eval and has the most consistent hidden state among the three cases.
  • ...and 5 more figures