Table of Contents
Fetching ...

MAGNET: Augmenting Generative Decoders with Representation Learning and Infilling Capabilities

Savya Khosla, Aditi Tiwari, Kushal Kafle, Simon Jenni, Handong Zhao, John Collomosse, Jing Shi

TL;DR

MAGNET tackles the gap between representation learning and text generation in decoder-only LLMs by introducing a hybrid attention mechanism that blends bidirectional and causal attention, plus three self-supervised objectives (MNTP, SSCL, MSG). It demonstrates that a single LLM (Llama-2-7B) can be fine-tuned to produce strong token- and sentence-level representations, perform context-aware infilling, and retain pretraining knowledge, all while preserving autoregressive generation. Across word- and sentence-level benchmarks, infilling tasks, repetition analyses, and knowledge/reasoning evaluations, MAGNET shows superior or competitive performance compared with specialized encoders and prior adaptation methods. This approach offers a practical, scalable, and unified path to combine understanding and generation in one model, with potential for broad downstream impact in NLP systems.

Abstract

While originally designed for unidirectional generative modeling, decoder-only large language models (LLMs) are increasingly being adapted for bidirectional modeling. However, unidirectional and bidirectional models are typically trained separately with distinct objectives (generation and representation learning). This separation overlooks the opportunity for developing a more versatile language model and for these objectives to complement each other. In this work, we propose MAGNET, a method for adapting decoder-only LLMs to generate robust representations and infill missing text spans. MAGNET employs three self-supervised training objectives and introduces an attention mechanism that combines bidirectional and causal attention, enabling unified training across all objectives. Our results demonstrate that LLMs adapted with MAGNET (1) surpass strong text encoders on token-level and sentence-level representation learning tasks, (2) generate contextually appropriate text infills by leveraging past and future contexts, (3) perform open-ended text generation without excessive repetition of words or phrases, and (4) preserve the knowledge and reasoning capability gained by the LLM during pretraining.

MAGNET: Augmenting Generative Decoders with Representation Learning and Infilling Capabilities

TL;DR

MAGNET tackles the gap between representation learning and text generation in decoder-only LLMs by introducing a hybrid attention mechanism that blends bidirectional and causal attention, plus three self-supervised objectives (MNTP, SSCL, MSG). It demonstrates that a single LLM (Llama-2-7B) can be fine-tuned to produce strong token- and sentence-level representations, perform context-aware infilling, and retain pretraining knowledge, all while preserving autoregressive generation. Across word- and sentence-level benchmarks, infilling tasks, repetition analyses, and knowledge/reasoning evaluations, MAGNET shows superior or competitive performance compared with specialized encoders and prior adaptation methods. This approach offers a practical, scalable, and unified path to combine understanding and generation in one model, with potential for broad downstream impact in NLP systems.

Abstract

While originally designed for unidirectional generative modeling, decoder-only large language models (LLMs) are increasingly being adapted for bidirectional modeling. However, unidirectional and bidirectional models are typically trained separately with distinct objectives (generation and representation learning). This separation overlooks the opportunity for developing a more versatile language model and for these objectives to complement each other. In this work, we propose MAGNET, a method for adapting decoder-only LLMs to generate robust representations and infill missing text spans. MAGNET employs three self-supervised training objectives and introduces an attention mechanism that combines bidirectional and causal attention, enabling unified training across all objectives. Our results demonstrate that LLMs adapted with MAGNET (1) surpass strong text encoders on token-level and sentence-level representation learning tasks, (2) generate contextually appropriate text infills by leveraging past and future contexts, (3) perform open-ended text generation without excessive repetition of words or phrases, and (4) preserve the knowledge and reasoning capability gained by the LLM during pretraining.
Paper Structure (23 sections, 5 equations, 6 figures, 12 tables)

This paper contains 23 sections, 5 equations, 6 figures, 12 tables.

Figures (6)

  • Figure 1: Traditionally, LLMs are trained for text generation using unidirectional attention between the input $x$ and output $y$ (depicted by black lines), whereas text encoders are trained for representation learning using bidirectional attention (depicted by gray lines). Magnet adapts the attention mechanism of LLMs to combine both unidirectional and bidirectional attention, enhancing them with representation learning and infilling capabilities, while retaining their core generative functions.
  • Figure 2: Attention masks for different types of attention mechanisms. The rows of the matrices correspond to the query tokens and the columns correspond to the key tokens. Light gray cells indicate 0, dark gray cells represent $-\infty$, green marks span token positions, and blue marks context token positions. Each context token attends to every other context token, and each span token attends to all context tokens and the preceding span tokens in the same span.
  • Figure 3: Magnet training objectives include: (a) Masked next token prediction, which is applied on the output corresponding to the token preceding the masked context token. (b) Self-supervised contrastive learning, which is applied on the model's representation corresponding to the last token. (c) Missing span generation, which is applied on the output corresponding to the span tokens. In this illustration, the red lines denote bidirectional attention and the black lines denote causal attention. Further, for (a) and (c), the output token $y_i$ is trained to predict the input token $x_{i+1}$, as denoted by "$y_i \sim x_{i+1}$"
  • Figure 4: Magnet processes three views of the input using different attention mechanisms within the same LLM. The model is trained (or fine-tuned) using three self-supervised learning objectives simultaneously to augment it with the ability to generate token-level and sentence-level representations and perform text infilling tasks, while maintaining its original left-to-right text generation capability.
  • Figure 5: LLM2Vec increases text repetition with more training, while no such trend is observed for Magnet.
  • ...and 1 more figures