Table of Contents
Fetching ...

Closing the Data-Efficiency Gap Between Autoregressive and Masked Diffusion LLMs

Xu Pan, Ely Hahami, Jingxuan Fan, Ziqian Xie, Haim Sompolinsky

TL;DR

The paper tackles the challenge of updating pre-trained LLMs with new factual knowledge, highlighting the reversal curse that plagues autoregressive fine-tuning and the data-efficiency advantage of masked diffusion LLMs. It shows AR models rely on paraphrase augmentation to generalize QA and struggle with backward questions, while dLLMs perform well without paraphrases; inspired by this, it introduces a masked fine-tuning objective for AR LLMs that dramatically improves data efficiency and closes the gap to dLLMs, with extensions to masked SFT for math tasks. The approach yields strong forward and backward QA performance without paraphrases, converges rapidly, and reduces the cost of continual knowledge updates. Overall, the work presents a practical pathway to post-training knowledge injection that leverages a diffusion-inspired masking objective, enabling more lifelike lifelong learning in diverse LLM families.

Abstract

Despite autoregressive large language models (arLLMs) being the current dominant paradigm in language modeling, effectively updating these models to incorporate new factual knowledge still remains difficult. They resist knowledge injection via fine-tuning due to inherent shortcomings such as the "reversal curse" -- the challenge of answering questions that reverse the original information order in the training sample. Masked diffusion large language models (dLLMs) are rapidly emerging as a powerful alternative to the arLLM paradigm, with evidence of better data efficiency and free of the "reversal curse" in pre-training. However, it is unknown whether these advantages extend to the post-training phase, i.e. whether pre-trained dLLMs can easily acquire new knowledge through fine-tuning. On three diverse datasets, we fine-tune arLLMs and dLLMs, evaluating them with forward and backward style Question Answering (QA) to probe knowledge generalization and the reversal curse. Our results confirm that arLLMs critically rely on extensive data augmentation via paraphrases for QA generalization, and paraphrases are only effective when their information order matches the QA style. Conversely, dLLMs achieve high accuracies on both forward and backward QAs without paraphrases; adding paraphrases yields only marginal gains. Inspired by the dLLM's performance, we introduce a novel masked fine-tuning paradigm for knowledge injection into pre-trained arLLMs. This proposed method successfully and drastically improves the data efficiency of arLLM fine-tuning, effectively closing its performance gap with dLLMs. We further show that the masked fine-tuning paradigm of arLLMs can be extended to the supervised fine-tuning (SFT) of mathematical capability. Across two models and two datasets, our masked SFT outperforms regular SFT.

Closing the Data-Efficiency Gap Between Autoregressive and Masked Diffusion LLMs

TL;DR

The paper tackles the challenge of updating pre-trained LLMs with new factual knowledge, highlighting the reversal curse that plagues autoregressive fine-tuning and the data-efficiency advantage of masked diffusion LLMs. It shows AR models rely on paraphrase augmentation to generalize QA and struggle with backward questions, while dLLMs perform well without paraphrases; inspired by this, it introduces a masked fine-tuning objective for AR LLMs that dramatically improves data efficiency and closes the gap to dLLMs, with extensions to masked SFT for math tasks. The approach yields strong forward and backward QA performance without paraphrases, converges rapidly, and reduces the cost of continual knowledge updates. Overall, the work presents a practical pathway to post-training knowledge injection that leverages a diffusion-inspired masking objective, enabling more lifelike lifelong learning in diverse LLM families.

Abstract

Despite autoregressive large language models (arLLMs) being the current dominant paradigm in language modeling, effectively updating these models to incorporate new factual knowledge still remains difficult. They resist knowledge injection via fine-tuning due to inherent shortcomings such as the "reversal curse" -- the challenge of answering questions that reverse the original information order in the training sample. Masked diffusion large language models (dLLMs) are rapidly emerging as a powerful alternative to the arLLM paradigm, with evidence of better data efficiency and free of the "reversal curse" in pre-training. However, it is unknown whether these advantages extend to the post-training phase, i.e. whether pre-trained dLLMs can easily acquire new knowledge through fine-tuning. On three diverse datasets, we fine-tune arLLMs and dLLMs, evaluating them with forward and backward style Question Answering (QA) to probe knowledge generalization and the reversal curse. Our results confirm that arLLMs critically rely on extensive data augmentation via paraphrases for QA generalization, and paraphrases are only effective when their information order matches the QA style. Conversely, dLLMs achieve high accuracies on both forward and backward QAs without paraphrases; adding paraphrases yields only marginal gains. Inspired by the dLLM's performance, we introduce a novel masked fine-tuning paradigm for knowledge injection into pre-trained arLLMs. This proposed method successfully and drastically improves the data efficiency of arLLM fine-tuning, effectively closing its performance gap with dLLMs. We further show that the masked fine-tuning paradigm of arLLMs can be extended to the supervised fine-tuning (SFT) of mathematical capability. Across two models and two datasets, our masked SFT outperforms regular SFT.

Paper Structure

This paper contains 22 sections, 2 equations, 16 figures, 5 tables.

Figures (16)

  • Figure 1: A schematic summary of the results. First row: autoregressive LLM requires paraphrases for generalizing knowledge in the fine-tuning text to QA tasks, and suffer from reversal curse (i.e. fail to answer backward questions). Second row: masked diffusion LLM can easily generalize fine-tuning text to QA tasks in both forward and backward styles. Third row: inspired by the masked diffusion LLM, we propose a masked fine-tuning paradigm, that closes the fine-tuning gap between autoregressive LLMs and masked diffusion LLMs.
  • Figure 2: Training dynamics of arLLM (Llama 8B), dLLM (Llada), and masked arLLM (Llama 8B). For the NameDescription dataset, forward and backward accuracy are the average of N2D and D2N types. Paraphrases used in the Wiki dataset are the same-order paraphrases set. Due to the randomness of sampling the masks, we average across 4 random seed for the dLLM and masked arLLM on NameDescription and Biography Datasets. Curves for each seed are shown in Appendix Figure \ref{['fig:seed_dllm']}-\ref{['fig:seed_ar']}.
  • Figure 3: An example of masked fine-tuning prompt. Random selection of text tokens are replaced by a [MASK] token. Highlighted tokens are used to compute the autoregressive loss.
  • Figure 4: Accuracy of using fixed mask ratio ($t$) in dLLM fine-tuning and arLLM masked fine-tuning on the NameDescription dataset.
  • Figure 5: Learning rate sweep of Llama-3.1-8B-instruct. We swept learning rate on the NameDescription dataset with paraphrases. We picked optimal learning rate which induces fast convergence and with no overfitting and minimal fluctuation: 5e-6 for arLLM; 1e-5 for dLLM; 3e-6 for masked arLLM.
  • ...and 11 more figures