MAGNET: Augmenting Generative Decoders with Representation Learning and Infilling Capabilities
Savya Khosla, Aditi Tiwari, Kushal Kafle, Simon Jenni, Handong Zhao, John Collomosse, Jing Shi
TL;DR
MAGNET tackles the gap between representation learning and text generation in decoder-only LLMs by introducing a hybrid attention mechanism that blends bidirectional and causal attention, plus three self-supervised objectives (MNTP, SSCL, MSG). It demonstrates that a single LLM (Llama-2-7B) can be fine-tuned to produce strong token- and sentence-level representations, perform context-aware infilling, and retain pretraining knowledge, all while preserving autoregressive generation. Across word- and sentence-level benchmarks, infilling tasks, repetition analyses, and knowledge/reasoning evaluations, MAGNET shows superior or competitive performance compared with specialized encoders and prior adaptation methods. This approach offers a practical, scalable, and unified path to combine understanding and generation in one model, with potential for broad downstream impact in NLP systems.
Abstract
While originally designed for unidirectional generative modeling, decoder-only large language models (LLMs) are increasingly being adapted for bidirectional modeling. However, unidirectional and bidirectional models are typically trained separately with distinct objectives (generation and representation learning). This separation overlooks the opportunity for developing a more versatile language model and for these objectives to complement each other. In this work, we propose MAGNET, a method for adapting decoder-only LLMs to generate robust representations and infill missing text spans. MAGNET employs three self-supervised training objectives and introduces an attention mechanism that combines bidirectional and causal attention, enabling unified training across all objectives. Our results demonstrate that LLMs adapted with MAGNET (1) surpass strong text encoders on token-level and sentence-level representation learning tasks, (2) generate contextually appropriate text infills by leveraging past and future contexts, (3) perform open-ended text generation without excessive repetition of words or phrases, and (4) preserve the knowledge and reasoning capability gained by the LLM during pretraining.
