Table of Contents
Fetching ...

Large (Vision) Language Models are Unsupervised In-Context Learners

Artyom Gadetsky, Andrei Atanov, Yulun Jiang, Zhitong Gao, Ghazal Hosseini Mighan, Amir Zamir, Maria Brbic

TL;DR

The paper addresses unsupervised adaptation of foundation models to new tasks by introducing a joint inference framework that predicts multiple inputs simultaneously. It derives two practical methods: unsupervised fine-tuning (requiring access to model weights and joint probabilities) and unsupervised in-context learning (no weight access, leveraging iterative self-prompting). Across NLP and vision-language benchmarks, the methods yield substantial gains over zero-shot and often rival supervised approaches, including a 39% absolute improvement on GSM8K and notable improvements on image tasks. The work enables scalable, label-free adaptation for diverse models, including GPT-4o, by either updating a lightweight task encoder or by iterative self-labeling, with clear trade-offs in compute and applicability across model types.

Abstract

Recent advances in large language and vision-language models have enabled zero-shot inference, allowing models to solve new tasks without task-specific training. Various adaptation techniques such as prompt engineering, In-Context Learning (ICL), and supervised fine-tuning can further enhance the model's performance on a downstream task, but they require substantial manual effort to construct effective prompts or labeled examples. In this work, we introduce a joint inference framework for fully unsupervised adaptation, eliminating the need for manual prompt engineering and labeled examples. Unlike zero-shot inference, which makes independent predictions, the joint inference makes predictions simultaneously for all inputs in a given task. Since direct joint inference involves computationally expensive optimization, we develop efficient approximation techniques, leading to two unsupervised adaptation methods: unsupervised fine-tuning and unsupervised ICL. We demonstrate the effectiveness of our methods across diverse tasks and models, including language-only Llama-3.1 on natural language processing tasks, reasoning-oriented Qwen2.5-Math on grade school math problems, vision-language OpenFlamingo on vision tasks, and the API-only access GPT-4o model on massive multi-discipline tasks. Our experiments demonstrate substantial improvements over the standard zero-shot approach, including 39% absolute improvement on the challenging GSM8K math reasoning dataset. Remarkably, despite being fully unsupervised, our framework often performs on par with supervised approaches that rely on ground truth labels.

Large (Vision) Language Models are Unsupervised In-Context Learners

TL;DR

The paper addresses unsupervised adaptation of foundation models to new tasks by introducing a joint inference framework that predicts multiple inputs simultaneously. It derives two practical methods: unsupervised fine-tuning (requiring access to model weights and joint probabilities) and unsupervised in-context learning (no weight access, leveraging iterative self-prompting). Across NLP and vision-language benchmarks, the methods yield substantial gains over zero-shot and often rival supervised approaches, including a 39% absolute improvement on GSM8K and notable improvements on image tasks. The work enables scalable, label-free adaptation for diverse models, including GPT-4o, by either updating a lightweight task encoder or by iterative self-labeling, with clear trade-offs in compute and applicability across model types.

Abstract

Recent advances in large language and vision-language models have enabled zero-shot inference, allowing models to solve new tasks without task-specific training. Various adaptation techniques such as prompt engineering, In-Context Learning (ICL), and supervised fine-tuning can further enhance the model's performance on a downstream task, but they require substantial manual effort to construct effective prompts or labeled examples. In this work, we introduce a joint inference framework for fully unsupervised adaptation, eliminating the need for manual prompt engineering and labeled examples. Unlike zero-shot inference, which makes independent predictions, the joint inference makes predictions simultaneously for all inputs in a given task. Since direct joint inference involves computationally expensive optimization, we develop efficient approximation techniques, leading to two unsupervised adaptation methods: unsupervised fine-tuning and unsupervised ICL. We demonstrate the effectiveness of our methods across diverse tasks and models, including language-only Llama-3.1 on natural language processing tasks, reasoning-oriented Qwen2.5-Math on grade school math problems, vision-language OpenFlamingo on vision tasks, and the API-only access GPT-4o model on massive multi-discipline tasks. Our experiments demonstrate substantial improvements over the standard zero-shot approach, including 39% absolute improvement on the challenging GSM8K math reasoning dataset. Remarkably, despite being fully unsupervised, our framework often performs on par with supervised approaches that rely on ground truth labels.

Paper Structure

This paper contains 27 sections, 17 equations, 9 figures, 8 tables, 2 algorithms.

Figures (9)

  • Figure 1: Joint inference framework for foundation models.Left: Unlike the standard zero-shot inference that makes a prediction $y$ independently for each input $x$, the joint inference makes predictions for multiple inputs at the same time, leveraging dependencies between all examples. Right: We develop two methods to perform the joint inference that achieve substantial improvements over traditional zero-shot inference: unsupervised fine-tuning and unsupervised ICL. Their performance increases as the number of examples $N$ for the joint inference increases, showing the effectiveness of the proposed joint inference framework.
  • Figure 2: Unsupervised fine-tuning is a principled optimization method to perform joint inference, enabling unsupervised adaptation on a new task. Given a dataset of questions, each iteration of the optimization involves generating answers via task encoder independently for a batch of questions (Step 1). Subsequently, these answers are fed into a foundation model to estimate the joint probability, providing the quantitative measure of the quality of the answers (Step 2). Finally, task encoder is updated to maximize the joint probability (Step 3). These steps are repeated until convergence, yielding the task encoder adapted on a given task without any supervision.
  • Figure 3: Unsupervised In-Context Learning is broadly applicable method to perform joint inference for any task and any existing foundation model. Left: Our method generates answers for each question independently using zero-shot prompting. Subsequently, it enters the multi-turn stage, where, at each turn, for each question, the model is prompted with randomly sampled in-context examples from the dataset (excluding the considered question) with the corresponding answers from the previous turn. These examples are fed into the model in the left-to-right order along with the current question to generate a refined answer. Such refinement is repeated for $T$ turns, yielding the final answers. Right: Both the joint inference objective and the performance improve with more optimization turns of the unsupervised ICL method.
  • Figure 4: Unsupervised ICL scales effectively at test-time. We report performance on the RTE dataset with Llama-3.1 models. We scale test-time compute by using more (unsupervised) ICL examples and show it provides a better compute-performance trade-off than zero-shot inference with a bigger model.
  • Figure 5: Unsupervised ICL and FT improve non-instruction-tuned models. We show the performance of different inference methods on the RTE dataset for base and instruction-tuned Llama-8B models. Both our methods applied to the base model outperform zero-shot inference with the instruction-tuned model.
  • ...and 4 more figures