Large (Vision) Language Models are Unsupervised In-Context Learners
Artyom Gadetsky, Andrei Atanov, Yulun Jiang, Zhitong Gao, Ghazal Hosseini Mighan, Amir Zamir, Maria Brbic
TL;DR
The paper addresses unsupervised adaptation of foundation models to new tasks by introducing a joint inference framework that predicts multiple inputs simultaneously. It derives two practical methods: unsupervised fine-tuning (requiring access to model weights and joint probabilities) and unsupervised in-context learning (no weight access, leveraging iterative self-prompting). Across NLP and vision-language benchmarks, the methods yield substantial gains over zero-shot and often rival supervised approaches, including a 39% absolute improvement on GSM8K and notable improvements on image tasks. The work enables scalable, label-free adaptation for diverse models, including GPT-4o, by either updating a lightweight task encoder or by iterative self-labeling, with clear trade-offs in compute and applicability across model types.
Abstract
Recent advances in large language and vision-language models have enabled zero-shot inference, allowing models to solve new tasks without task-specific training. Various adaptation techniques such as prompt engineering, In-Context Learning (ICL), and supervised fine-tuning can further enhance the model's performance on a downstream task, but they require substantial manual effort to construct effective prompts or labeled examples. In this work, we introduce a joint inference framework for fully unsupervised adaptation, eliminating the need for manual prompt engineering and labeled examples. Unlike zero-shot inference, which makes independent predictions, the joint inference makes predictions simultaneously for all inputs in a given task. Since direct joint inference involves computationally expensive optimization, we develop efficient approximation techniques, leading to two unsupervised adaptation methods: unsupervised fine-tuning and unsupervised ICL. We demonstrate the effectiveness of our methods across diverse tasks and models, including language-only Llama-3.1 on natural language processing tasks, reasoning-oriented Qwen2.5-Math on grade school math problems, vision-language OpenFlamingo on vision tasks, and the API-only access GPT-4o model on massive multi-discipline tasks. Our experiments demonstrate substantial improvements over the standard zero-shot approach, including 39% absolute improvement on the challenging GSM8K math reasoning dataset. Remarkably, despite being fully unsupervised, our framework often performs on par with supervised approaches that rely on ground truth labels.
