Table of Contents
Fetching ...

Unsupervised Layer-Wise Dynamic Test Time Adaptation for LLMs

Longhuan Xu, Cunjian Chen, Feng Yin

TL;DR

This paper tackles unsupervised, sample-specific test-time adaptation for LLMs by shifting from a single global learning rate to a fine-grained, layer-wise control mechanism. It introduces ScaleNet, a lightweight hypernetwork that predicts per-layer, per-step learning-rate multipliers to modulate LoRA updates during a short adaptation horizon, and it adopts a first-order unrolled optimization to train ScaleNet efficiently. Across diverse models and datasets, the proposed dynamic TTA yields improved stability and performance (lower NLL and higher ROUGE-Lsum) compared to fixed-rate baselines and layer-agnostic schedules, with the largest gains in early adaptation steps and on larger LLMs. The approach demonstrates that per-layer, per-step scaling is beneficial for unsupervised TTA, offering a practical path to more reliable and transferable prompt-specific adaptation in real-world deployments.

Abstract

Test-time adaptation (TTA) for large language models (LLMs) updates model parameters at inference time using signals available at deployment. This paper focuses on a common yet under-explored regime: unsupervised, sample-specific TTA, where the model adapts independently for each prompt using only the prompt itself, without gold answers or external supervision. Although appealing, naive unsupervised TTA with a fixed, handcrafted learning rate can be unstable: updates may overfit to prompt-specific statistics, drift from the desired answer distribution, and ultimately degrade generation quality. This failure mode is not surprising, as in this case TTA must adapt to a single prompt within only a few gradient steps, unlike standard training that averages updates over large datasets and long optimization horizons. Therefore, we propose layer-wise dynamic test-time adaptation, a framework which explicitly modulates TTA strength as a function of prompt representation, LLM structure and adaptation step. In our setting, TTA updates only LoRA parameters, and a lightweight hypernetwork predicts per-layer, per-step learning-rate multipliers, enabling fine-grained control. Experiments across various datasets and LLMs consistently show that our method substantially strengthens TTA by learning effective scaling patterns over adaptation steps and transformer layer projections, improving stability while delivering better performance.

Unsupervised Layer-Wise Dynamic Test Time Adaptation for LLMs

TL;DR

This paper tackles unsupervised, sample-specific test-time adaptation for LLMs by shifting from a single global learning rate to a fine-grained, layer-wise control mechanism. It introduces ScaleNet, a lightweight hypernetwork that predicts per-layer, per-step learning-rate multipliers to modulate LoRA updates during a short adaptation horizon, and it adopts a first-order unrolled optimization to train ScaleNet efficiently. Across diverse models and datasets, the proposed dynamic TTA yields improved stability and performance (lower NLL and higher ROUGE-Lsum) compared to fixed-rate baselines and layer-agnostic schedules, with the largest gains in early adaptation steps and on larger LLMs. The approach demonstrates that per-layer, per-step scaling is beneficial for unsupervised TTA, offering a practical path to more reliable and transferable prompt-specific adaptation in real-world deployments.

Abstract

Test-time adaptation (TTA) for large language models (LLMs) updates model parameters at inference time using signals available at deployment. This paper focuses on a common yet under-explored regime: unsupervised, sample-specific TTA, where the model adapts independently for each prompt using only the prompt itself, without gold answers or external supervision. Although appealing, naive unsupervised TTA with a fixed, handcrafted learning rate can be unstable: updates may overfit to prompt-specific statistics, drift from the desired answer distribution, and ultimately degrade generation quality. This failure mode is not surprising, as in this case TTA must adapt to a single prompt within only a few gradient steps, unlike standard training that averages updates over large datasets and long optimization horizons. Therefore, we propose layer-wise dynamic test-time adaptation, a framework which explicitly modulates TTA strength as a function of prompt representation, LLM structure and adaptation step. In our setting, TTA updates only LoRA parameters, and a lightweight hypernetwork predicts per-layer, per-step learning-rate multipliers, enabling fine-grained control. Experiments across various datasets and LLMs consistently show that our method substantially strengthens TTA by learning effective scaling patterns over adaptation steps and transformer layer projections, improving stability while delivering better performance.
Paper Structure (27 sections, 19 equations, 5 figures, 2 tables)

This paper contains 27 sections, 19 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Unsupervised layer-wise dynamic TTA training pipeline (right side).
  • Figure 2: Simple control hypernetwork: ScaleNet architecture.
  • Figure 3: NLL results. The vertical axis shows the average negative log-likelihood (NLL) per answer token, and the horizontal axis shows the number of TTA steps. The red curve is the naïve fixed-learning-rate baseline. Green and blue correspond to layer-agnostic/step-wise and layer-wise ScaleNet; yellow corresponds to sample-averaged layer-wise ScaleNet.
  • Figure 4: ScaleNet output heatmap averaged over 4 datasets and 4 moderate-size LLMs. Along horizontal axis, $q_k$ and $v_k$ denote query and value projection at step $k$. From top to bottom, transformer layers are ordered from shallow (first) to deep (last).
  • Figure 5: ScaleNet output percentage difference with 95% CI across schedules (baseline K=5) and all-schedules scaling magnitude mean. Both averaged over 4 dataset test samples and 4 moderate-size LLMs.