Optimization-Inspired Few-Shot Adaptation for Large Language Models
Boyan Gao, Xin Wang, Yibo Yang, David Clifton
TL;DR
This work tackles the challenge of few-shot adaptation for large language models by reframing the forward pass as a sequence of preconditioned gradient-descent steps and learning per-layer preconditioners via LayerNorm, avoiding extra trainable parameters and additional inference cost. It introduces two differentiable objectives—a step-ratio convergence penalty $\mathcal{J}(P)$ and a flat-region sharpness penalty based on the trace of the preconditioned Hessian $\mathrm{tr}(P_t\nabla^2 \mathcal{L}(Z_t) P_t^T)$—with Hutchinson’s estimator used to approximate the latter. The method, OFA, demonstrates strong improvements over ICL and PEFT baselines across diverse datasets and models, while reducing parameter overhead relative to methods like LoRA. This optimization-inspired approach offers a practical, data-efficient pathway for robust few-shot adaptation with real-world impact for varied NLP tasks, and lays groundwork for further theoretical and empirical refinements using internal optimization dynamics.
Abstract
Large Language Models (LLMs) have demonstrated remarkable performance in real-world applications. However, adapting LLMs to novel tasks via fine-tuning often requires substantial training data and computational resources that are impractical in few-shot scenarios. Existing approaches, such as in-context learning and Parameter-Efficient Fine-Tuning (PEFT), face key limitations: in-context learning introduces additional inference computational overhead with limited performance gains, while PEFT models are prone to overfitting on the few demonstration examples. In this work, we reinterpret the forward pass of LLMs as an optimization process, a sequence of preconditioned gradient descent steps refining internal representations. Based on this connection, we propose Optimization-Inspired Few-Shot Adaptation (OFA), integrating a parameterization that learns preconditioners without introducing additional trainable parameters, and an objective that improves optimization efficiency by learning preconditioners based on a convergence bound, while simultaneously steering the optimization path toward the flat local minimum. Our method overcomes both issues of ICL-based and PEFT-based methods, and demonstrates superior performance over the existing methods on a variety of few-shot adaptation tasks in experiments.
