ORAL: Prompting Your Large-Scale LoRAs via Conditional Recurrent Diffusion
Rana Muhammad Shahroz Khan, Dongwen Tang, Pingzhi Li, Kai Wang, Tianlong Chen
TL;DR
ORAL introduces a scalable, conditional recurrent diffusion framework that generates LoRA updates for large-scale, evolving foundation models by conditioning on both base-model architecture and textual task prompts. By tokenizing LoRA updates and applying a recurrent diffusion backbone, ORAL achieves high-capacity parameter generation (up to hundreds of millions of parameters) while maintaining task-specific controllability and transferability across model updates without retraining. Extensive experiments across vision, multimodal, and NLP tasks show ORAL matching or surpassing traditional fine-tuning baselines and strong generalization to unseen evolving models. This approach enables efficient, flexible adaptation in rapidly changing LLM ecosystems, reducing retraining costs and enabling practical deployment at scale.
Abstract
Parameter generation has emerged as a novel paradigm for neural network development, offering an alternative to traditional neural network training by synthesizing high-quality model weights directly. In the context of Low-Rank Adaptation (LoRA) for evolving ($\textit{i.e.}$, constantly updated) large language models (LLMs), this approach promises efficient adaptation without costly retraining. However, existing methods face critical limitations in simultaneously achieving scalability and controllability. In this paper, we introduce $\texttt{ORAL}$, a novel $\textbf{conditional recurrent diffusion}$ framework that addresses these challenges. $\texttt{ORAL}$ incorporates a novel conditioning mechanism that integrates model architecture and textual task specifications, enabling the generation of task-specific LoRA parameters that can seamlessly transfer across evolving foundation models. Our approach successfully scales to billions-of-parameter LLMs and maintains controllability. Through extensive experiments across seven language tasks, four vision tasks, and three multimodal tasks using five pre-trained LLMs, we demonstrate that $\texttt{ORAL}$ generates high-quality LoRA parameters that achieve comparable or superior performance to vanilla trained counterparts.
