Table of Contents
Fetching ...

Vision-Language Models Create Cross-Modal Task Representations

Grace Luo, Trevor Darrell, Amir Bar

TL;DR

Autoregressive vision-language models (VLMs) can handle diverse tasks within a single architecture, but their internal task representations were opaque. The authors identify a shared, end-delimiter task vector that compresses task specifications across text, image, or instruction into a modality-agnostic representation, and they validate cross-modal transfer via cross-modal patching, often outperforming full-prompt baselines. They further show that task vectors can transfer from base LLMs to fine-tuned VLMs and can be derived from instructions alone, improving sample efficiency. These findings illuminate how VLMs compress, reuse, and align task information across modalities, with practical implications for cross-modal instruction, VQA, and interactive AI, while acknowledging limitations and the need for deeper theoretical grounding.

Abstract

Autoregressive vision-language models (VLMs) can handle many tasks within a single model, yet the representations that enable this capability remain opaque. We find that VLMs align conceptually equivalent inputs into a shared task vector, which is invariant to modality (text, image) and format (examples, instruction), and may simplify VLM processing. We measure this alignment via cross-modal transfer -- the ability of a task vector derived in one modality to trigger the correct generation in another -- on a range of tasks and model architectures. Although the task vector is highly compressed, we find that this single vector outperforms prompting the model with the full task information, unique to this cross-modal case. Furthermore, we show that task vectors can be transferred from a base language model to its fine-tuned vision-language counterpart, and that they can be derived solely from instructions without the need for examples. Taken together, our findings shed light on how VLMs internally process task information, and how they map different modalities into common semantic representations. Project page: https://vlm-cross-modal-reps.github.io.

Vision-Language Models Create Cross-Modal Task Representations

TL;DR

Autoregressive vision-language models (VLMs) can handle diverse tasks within a single architecture, but their internal task representations were opaque. The authors identify a shared, end-delimiter task vector that compresses task specifications across text, image, or instruction into a modality-agnostic representation, and they validate cross-modal transfer via cross-modal patching, often outperforming full-prompt baselines. They further show that task vectors can transfer from base LLMs to fine-tuned VLMs and can be derived from instructions alone, improving sample efficiency. These findings illuminate how VLMs compress, reuse, and align task information across modalities, with practical implications for cross-modal instruction, VQA, and interactive AI, while acknowledging limitations and the need for deeper theoretical grounding.

Abstract

Autoregressive vision-language models (VLMs) can handle many tasks within a single model, yet the representations that enable this capability remain opaque. We find that VLMs align conceptually equivalent inputs into a shared task vector, which is invariant to modality (text, image) and format (examples, instruction), and may simplify VLM processing. We measure this alignment via cross-modal transfer -- the ability of a task vector derived in one modality to trigger the correct generation in another -- on a range of tasks and model architectures. Although the task vector is highly compressed, we find that this single vector outperforms prompting the model with the full task information, unique to this cross-modal case. Furthermore, we show that task vectors can be transferred from a base language model to its fine-tuned vision-language counterpart, and that they can be derived solely from instructions without the need for examples. Taken together, our findings shed light on how VLMs internally process task information, and how they map different modalities into common semantic representations. Project page: https://vlm-cross-modal-reps.github.io.

Paper Structure

This paper contains 24 sections, 3 equations, 12 figures, 13 tables.

Figures (12)

  • Figure 1: VLMs map conceptually equivalent inputs into a shared task representation. This representation is invariant to the specification, regardless of modality (image, text) and format (examples, instruction).
  • Figure 3: Cross-modal transfer. (a) A single compressed task vector from one modality can induce the VLM to perform the task on queries from another modality, without additional training; this outperforms (b) feeding the full task information (see \ref{['tab:text-to-image']}).
  • Figure 4: Given the same text examples, patching is more effective than prompting. We show qualitative examples transferring task information from text examples to image queries. Few-shot prompting ( Prompt) regurgitates the input while cross-modal patching ( Patch) successfully performs the task.
  • Figure 5: Inter-model transfer. For the same text examples, the base LLM and fine-tuned VLM contain highly similar task vectors (left). LLM task vectors can be patched onto image queries (right).
  • Figure 6: Ensembling instruction- and example-based task vectors improves sample efficiency. For cross-modal patching onto image queries, we compare the average task accuracy when using instructions, text examples, or an ensemble of the two. We plot the mean accuracy (solid lines) and variance (shaded regions), aggregated over three seeds.
  • ...and 7 more figures