Disentangling Latent Shifts of In-Context Learning with Weak Supervision
Josip Jukić, Jan Šnajder
TL;DR
This paper tackles instability and inefficiency in in-context learning as demonstrations grow. It reframes ICL as weak supervision and introduces wilda, a teacher–student framework that encodes demonstration-induced latent shifts into lightweight adapters, allowing a query-only student to reproduce teacher-like behavior. The core idea rests on a decomposition that separates zero-shot and ICL contributions, enabling adapter-based knowledge fusion and scalable handling of long contexts via adapter arithmetic. Empirically, wilda improves generalization, stability, and efficiency across ID and near-OOD tasks, often surpassing the teacher through pseudo-label correction and coverage expansion, illustrating a weak-to-strong generalization phenomenon. Collectively, wilda offers a modular, scalable approach to stable task adaptation in LLMs by treating ICL as a source of weak supervision and storing demonstrations as reusable parameter shifts.
Abstract
In-context learning (ICL) enables large language models to perform few-shot learning by conditioning on labeled examples in the prompt. Despite its flexibility, ICL suffers from instability -- especially as prompt length increases with more demonstrations. To address this, we treat ICL as a source of weak supervision and propose a parameter-efficient method that disentangles demonstration-induced latent shifts from those of the query. An ICL-based teacher generates pseudo-labels on unlabeled queries, while a student predicts them using only the query input, updating a lightweight adapter. This captures demonstration effects in a compact, reusable form, enabling efficient inference while remaining composable with new demonstrations. Although trained on noisy teacher outputs, the student often outperforms its teacher through pseudo-label correction and coverage expansion, consistent with the weak-to-strong generalization effect. Empirically, our method improves generalization, stability, and efficiency across both in-domain and out-of-domain tasks, surpassing standard ICL and prior disentanglement methods.
