Table of Contents
Fetching ...

Optimization-Inspired Few-Shot Adaptation for Large Language Models

Boyan Gao, Xin Wang, Yibo Yang, David Clifton

TL;DR

This work tackles the challenge of few-shot adaptation for large language models by reframing the forward pass as a sequence of preconditioned gradient-descent steps and learning per-layer preconditioners via LayerNorm, avoiding extra trainable parameters and additional inference cost. It introduces two differentiable objectives—a step-ratio convergence penalty $\mathcal{J}(P)$ and a flat-region sharpness penalty based on the trace of the preconditioned Hessian $\mathrm{tr}(P_t\nabla^2 \mathcal{L}(Z_t) P_t^T)$—with Hutchinson’s estimator used to approximate the latter. The method, OFA, demonstrates strong improvements over ICL and PEFT baselines across diverse datasets and models, while reducing parameter overhead relative to methods like LoRA. This optimization-inspired approach offers a practical, data-efficient pathway for robust few-shot adaptation with real-world impact for varied NLP tasks, and lays groundwork for further theoretical and empirical refinements using internal optimization dynamics.

Abstract

Large Language Models (LLMs) have demonstrated remarkable performance in real-world applications. However, adapting LLMs to novel tasks via fine-tuning often requires substantial training data and computational resources that are impractical in few-shot scenarios. Existing approaches, such as in-context learning and Parameter-Efficient Fine-Tuning (PEFT), face key limitations: in-context learning introduces additional inference computational overhead with limited performance gains, while PEFT models are prone to overfitting on the few demonstration examples. In this work, we reinterpret the forward pass of LLMs as an optimization process, a sequence of preconditioned gradient descent steps refining internal representations. Based on this connection, we propose Optimization-Inspired Few-Shot Adaptation (OFA), integrating a parameterization that learns preconditioners without introducing additional trainable parameters, and an objective that improves optimization efficiency by learning preconditioners based on a convergence bound, while simultaneously steering the optimization path toward the flat local minimum. Our method overcomes both issues of ICL-based and PEFT-based methods, and demonstrates superior performance over the existing methods on a variety of few-shot adaptation tasks in experiments.

Optimization-Inspired Few-Shot Adaptation for Large Language Models

TL;DR

This work tackles the challenge of few-shot adaptation for large language models by reframing the forward pass as a sequence of preconditioned gradient-descent steps and learning per-layer preconditioners via LayerNorm, avoiding extra trainable parameters and additional inference cost. It introduces two differentiable objectives—a step-ratio convergence penalty and a flat-region sharpness penalty based on the trace of the preconditioned Hessian —with Hutchinson’s estimator used to approximate the latter. The method, OFA, demonstrates strong improvements over ICL and PEFT baselines across diverse datasets and models, while reducing parameter overhead relative to methods like LoRA. This optimization-inspired approach offers a practical, data-efficient pathway for robust few-shot adaptation with real-world impact for varied NLP tasks, and lays groundwork for further theoretical and empirical refinements using internal optimization dynamics.

Abstract

Large Language Models (LLMs) have demonstrated remarkable performance in real-world applications. However, adapting LLMs to novel tasks via fine-tuning often requires substantial training data and computational resources that are impractical in few-shot scenarios. Existing approaches, such as in-context learning and Parameter-Efficient Fine-Tuning (PEFT), face key limitations: in-context learning introduces additional inference computational overhead with limited performance gains, while PEFT models are prone to overfitting on the few demonstration examples. In this work, we reinterpret the forward pass of LLMs as an optimization process, a sequence of preconditioned gradient descent steps refining internal representations. Based on this connection, we propose Optimization-Inspired Few-Shot Adaptation (OFA), integrating a parameterization that learns preconditioners without introducing additional trainable parameters, and an objective that improves optimization efficiency by learning preconditioners based on a convergence bound, while simultaneously steering the optimization path toward the flat local minimum. Our method overcomes both issues of ICL-based and PEFT-based methods, and demonstrates superior performance over the existing methods on a variety of few-shot adaptation tasks in experiments.

Paper Structure

This paper contains 17 sections, 4 theorems, 30 equations, 3 figures, 7 tables, 1 algorithm.

Key Result

Theorem 3.1

Let $f: \mathbb{R}^d \rightarrow \mathbb{R}$ be a twice continuously differentiable function with locally Lipschitz gradients. Suppose the update rule is given by: where each $P_t \in \mathbb{R}^d \times \mathbb{R}^d$ is a learnable preconditioning matrix. Define the step-ratio objective in Eq. eq:step_ratio_loss Under the assumption that $f$ admits a local second-order Taylor expansion approxima

Figures (3)

  • Figure 1: Probe Analysis on EMO, SST, and TREC. The layer-wise prediction accuracy (%) and loss on the test set comparison is conducted with four competitors, CE, CE + Step ratio, CE + Sharpness, and Ours. CE denotes the Llama2-7B model adapted to the target set through CrossEntropy loss via updating the layernorm parameters; CE + Step ratio follows the same adaptation protocol as CE but with Step ratio penalty attached in Eq. \ref{['eq:step_ratio_loss']}; CE + Sharpness uses Sharpness in Eq. \ref{['hessian_approx']} instead while Ours utilizing the OFA objective in Eq. \ref{['eq:main_objective']}.
  • Figure 2: Sharpness comparison on MR, Subj and TREC. The average sharpness over the test samples across different layers on three models, with base model denoting the few-shot (ICL) setting, CE representing the model trained by the CrossEntropy on the demonstration samples, and Ours trained by OFA via the same adaptation protocol as that utilised in CE.
  • Figure 3: Step ratio comparison across the test sets of AGNews, Subj, and TREC over each layer of models based on Llama-7B. We compare the base model with demonstration examples (Base model), the model fine-tuned using CrossEntropy (CE), and the model tuned with OFA (Ours).

Theorems & Definitions (6)

  • Theorem 3.1
  • Theorem 3.2
  • Theorem B.1
  • proof
  • Theorem C.1
  • proof