Large (Vision) Language Models are Unsupervised In-Context Learners

Artyom Gadetsky; Andrei Atanov; Yulun Jiang; Zhitong Gao; Ghazal Hosseini Mighan; Amir Zamir; Maria Brbic

Large (Vision) Language Models are Unsupervised In-Context Learners

Artyom Gadetsky, Andrei Atanov, Yulun Jiang, Zhitong Gao, Ghazal Hosseini Mighan, Amir Zamir, Maria Brbic

TL;DR

The paper addresses unsupervised adaptation of foundation models to new tasks by introducing a joint inference framework that predicts multiple inputs simultaneously. It derives two practical methods: unsupervised fine-tuning (requiring access to model weights and joint probabilities) and unsupervised in-context learning (no weight access, leveraging iterative self-prompting). Across NLP and vision-language benchmarks, the methods yield substantial gains over zero-shot and often rival supervised approaches, including a 39% absolute improvement on GSM8K and notable improvements on image tasks. The work enables scalable, label-free adaptation for diverse models, including GPT-4o, by either updating a lightweight task encoder or by iterative self-labeling, with clear trade-offs in compute and applicability across model types.

Abstract

Recent advances in large language and vision-language models have enabled zero-shot inference, allowing models to solve new tasks without task-specific training. Various adaptation techniques such as prompt engineering, In-Context Learning (ICL), and supervised fine-tuning can further enhance the model's performance on a downstream task, but they require substantial manual effort to construct effective prompts or labeled examples. In this work, we introduce a joint inference framework for fully unsupervised adaptation, eliminating the need for manual prompt engineering and labeled examples. Unlike zero-shot inference, which makes independent predictions, the joint inference makes predictions simultaneously for all inputs in a given task. Since direct joint inference involves computationally expensive optimization, we develop efficient approximation techniques, leading to two unsupervised adaptation methods: unsupervised fine-tuning and unsupervised ICL. We demonstrate the effectiveness of our methods across diverse tasks and models, including language-only Llama-3.1 on natural language processing tasks, reasoning-oriented Qwen2.5-Math on grade school math problems, vision-language OpenFlamingo on vision tasks, and the API-only access GPT-4o model on massive multi-discipline tasks. Our experiments demonstrate substantial improvements over the standard zero-shot approach, including 39% absolute improvement on the challenging GSM8K math reasoning dataset. Remarkably, despite being fully unsupervised, our framework often performs on par with supervised approaches that rely on ground truth labels.

Large (Vision) Language Models are Unsupervised In-Context Learners

TL;DR

Abstract

Large (Vision) Language Models are Unsupervised In-Context Learners

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)