Table of Contents
Fetching ...

Multiple Choice Learning of Low-Rank Adapters for Language Modeling

Victor Letzelter, Hugo Malard, Mathieu Fontaine, Gaël Richard, Slim Essid, Andrei Bursuc, Patrick Pérez

TL;DR

LoRA-MCL is proposed, a training scheme that extends next-token prediction in language models with a method designed to decode diverse, plausible sentence continuations at inference time, and leverages Multiple Choice Learning and the Winner-Takes-All loss to efficiently handle ambiguity through Low-Rank Adaptation (LoRA).

Abstract

We propose LoRA-MCL, a training scheme that extends next-token prediction in language models with a method designed to decode diverse, plausible sentence continuations at inference time. Traditional language modeling is an intrinsically ill-posed problem: given a context, multiple ``futures'' may be equally plausible. Our approach leverages Multiple Choice Learning (MCL) and the Winner-Takes-All loss to efficiently handle ambiguity through Low-Rank Adaptation. We provide a theoretical interpretation of applying MCL to language modeling, assuming the data is generated from a mixture of distributions. We illustrate the proposed approach using mixtures of Markov chains. We then demonstrate with experiments on visual and audio captioning, as well as machine translation, that our method achieves high diversity and relevance in generated outputs. The accompanying code and a general-purpose package for applying LoRA-MCL to a wide range of language models are made available.

Multiple Choice Learning of Low-Rank Adapters for Language Modeling

TL;DR

LoRA-MCL is proposed, a training scheme that extends next-token prediction in language models with a method designed to decode diverse, plausible sentence continuations at inference time, and leverages Multiple Choice Learning and the Winner-Takes-All loss to efficiently handle ambiguity through Low-Rank Adaptation (LoRA).

Abstract

We propose LoRA-MCL, a training scheme that extends next-token prediction in language models with a method designed to decode diverse, plausible sentence continuations at inference time. Traditional language modeling is an intrinsically ill-posed problem: given a context, multiple ``futures'' may be equally plausible. Our approach leverages Multiple Choice Learning (MCL) and the Winner-Takes-All loss to efficiently handle ambiguity through Low-Rank Adaptation. We provide a theoretical interpretation of applying MCL to language modeling, assuming the data is generated from a mixture of distributions. We illustrate the proposed approach using mixtures of Markov chains. We then demonstrate with experiments on visual and audio captioning, as well as machine translation, that our method achieves high diversity and relevance in generated outputs. The accompanying code and a general-purpose package for applying LoRA-MCL to a wide range of language models are made available.

Paper Structure

This paper contains 51 sections, 5 theorems, 45 equations, 11 figures, 12 tables.

Key Result

Proposition 1

Assume a data-generating process $p(x\,|\,c) = \sum_{k = 1}^{K} p(z_k\,|\,c) \; p(x\,|\,z_k, c)$ (Asm. asm:mixture), perfect model expressiveness (Asm. asm:expressiveness), and large enough batch size to approximate the true risk (Asm. asm:true_risk). Then:

Figures (11)

  • Figure 1: LoRA-MCL. Components of a linear layer $\ell$ where LoRA is enabled, with context $c$ omitted. Frozen base weights $W_{\ell}$ are in blue; trainable LoRA adapters in light red. The forward pass (in gray) is computed independently for each hypothesis, where $h^{(1)}, \dots, h^{(K)}$ denote the hidden states as in \ref{['eq:group_lora']}. Gradients (purple arrows) are stronger for the winning hypothesis ($k^{\star}$).
  • Figure 2: Comparison of MCL with MLE.(Left) Validation loss over training steps (averaged across three seeds) for LoRA-MLE (blue) and LoRA-MCL (red). The theoretical optimal MLE loss is the entropy $\mathcal{H}(x)$. The gray shaded region represents the bounds of the theoretical optimal MCL loss, as given by $(a)$ and $(b)$ in \ref{['eq:inequality1']}. (Right) Learned transition matrices (top) versus references (bottom). MLE converges approximately toward the weighted average $\bar{P}$ (right-hand side of \ref{['eq:stationary']}), whereas LoRA-MCL recovers the two modes.
  • Figure 2: SPIDEr $(\uparrow)$ & mBLEU-4 $(\downarrow)$ on different parts of synthetic test set.
  • Figure 3: Quality–diversity trade-off on audio captioning (5 candidates). SPIDEr $(\uparrow)$ for quality, mBLEU-4 $(\downarrow)$ for diversity. Marker shape stands for the method, color for the decoding method, and size is proportional to forward passes per example at inference. LoRA-MLE uses $r\in\{8,8K\}$ for parameter parity. LoRA-MCL uses circle markers: Relaxed (black edge) and Annealed (red edge).
  • Figure 3: Quality and Diversity Evaluation on TextCaps (3 candidates). Best in bold; second-best underlined. Higher is better $(\uparrow)$ except mBLEU-4 $(\downarrow)$.
  • ...and 6 more figures

Theorems & Definitions (7)

  • Proposition 1: Proof in Apx. \ref{['sec:proof_prop:em']}
  • Corollary 1: Proof in Apx. \ref{['apx:proof_mc']}
  • Proposition 2: mackay2003information, mackay2003information
  • proof
  • Remark 1
  • Proposition 3
  • Corollary 2