ChameleonLLM: Batch-Aware Dynamic Low-Rank Adaptation via Inference-Time Clusters

Kamer Ali Yuksel; Hassan Sawaf

ChameleonLLM: Batch-Aware Dynamic Low-Rank Adaptation via Inference-Time Clusters

Kamer Ali Yuksel, Hassan Sawaf

TL;DR

ChameleonLLM tackles the problem of fixed-inference weights by introducing batch-aware clustering and a hyper-network that generates context-specific low-rank updates for the LM head, while using LoRA-style adapters in the transformer. The method leverages aggregated batch statistics to produce dynamic, batch-conditioned updates, avoiding the need to store large sets of pre-learned masks. Empirical results on WikiText-2 and Alpaca show meaningful improvements in validation loss and perplexity over traditional LoRA, demonstrating stronger generalization and robustness with open-domain inputs. This approach offers a practical, inference-time adaptation mechanism that reduces storage overhead and enhances responsiveness to real-world data distributions.

Abstract

Recent advances in large language models (LLMs) have shown remarkable performance across diverse tasks. However, these models are typically deployed with fixed weights, which limits their ability to adapt dynamically to the variability inherent in real-world data during inference. This paper introduces ChameleonLLM, a novel framework that enables inference-time adaptation of LLMs by leveraging batch-aware clustering and on-the-fly generation of low-rank updates. Unlike traditional fine-tuning approaches such as Low-Rank Adaptation (LoRA) or methods that rely on a fixed set of pre-learned uniforms (changeable masks), our method dynamically generates adaptive modifications to the decoder weights based on the aggregated statistics of clustered batches. By intelligently grouping similar inputs and computing context-aware low-rank updates via a hyper-network, ChameleonLLM achieves significant performance gains, outperforming conventional LoRA methods while eliminating the overhead of maintaining multiple expert models. Our experiments highlight the potential of our approach to serve as a versatile and highly adaptive solution for language model inference. ChameleonLLM is open-sourced to ensure the reproducibility of our experiments: https://anonymous.4open.science/r/ChamaleonLLM/

ChameleonLLM: Batch-Aware Dynamic Low-Rank Adaptation via Inference-Time Clusters

TL;DR

Abstract

ChameleonLLM: Batch-Aware Dynamic Low-Rank Adaptation via Inference-Time Clusters

TL;DR

Abstract

Paper Structure

Table of Contents