Bridging Vision and Language Spaces with Assignment Prediction

Jungin Park; Jiyoung Lee; Kwanghoon Sohn

Bridging Vision and Language Spaces with Assignment Prediction

Jungin Park, Jiyoung Lee, Kwanghoon Sohn

TL;DR

VLAP tackles the challenge of grounding frozen LLMs in the visual world with minimal training by introducing a single linear projection and an assignment-prediction objective based on optimal transport. By treating pretrained LLM word embeddings as a fixed central space and assigning both visual and textual representations to this space, VLAP preserves the semantic taxonomy of the LLM while enabling cross-modal alignment without updating the LLM weights. The method yields substantial improvements over prior linear-mapping approaches across zero-shot image captioning, VQA, and cross-modal retrieval, and it enables visual semantic arithmetic via the learned embedding space. This approach offers a computationally efficient, scalable path to leveraging powerful LLMs for vision-language tasks using existing unimodal foundations, with strong potential for further gains when scaled to larger multimodal models and data.

Abstract

This paper introduces VLAP, a novel approach that bridges pretrained vision models and large language models (LLMs) to make frozen LLMs understand the visual world. VLAP transforms the embedding space of pretrained vision models into the LLMs' word embedding space using a single linear layer for efficient and general-purpose visual and language understanding. Specifically, we harness well-established word embeddings to bridge two modality embedding spaces. The visual and text representations are simultaneously assigned to a set of word embeddings within pretrained LLMs by formulating the assigning procedure as an optimal transport problem. We predict the assignment of one modality from the representation of another modality data, enforcing consistent assignments for paired multimodal data. This allows vision and language representations to contain the same information, grounding the frozen LLMs' word embedding space in visual data. Moreover, a robust semantic taxonomy of LLMs can be preserved with visual data since the LLMs interpret and reason linguistic information from correlations between word embeddings. Experimental results show that VLAP achieves substantial improvements over the previous linear transformation-based approaches across a range of vision-language tasks, including image captioning, visual question answering, and cross-modal retrieval. We also demonstrate the learned visual representations hold a semantic taxonomy of LLMs, making visual semantic arithmetic possible.

Bridging Vision and Language Spaces with Assignment Prediction

TL;DR

Abstract

Paper Structure (38 sections, 8 equations, 8 figures, 9 tables)

This paper contains 38 sections, 8 equations, 8 figures, 9 tables.

Introduction
Related Work
Method
Assignment prediction
Visual and text representations from pretrained models.
Word assignment.
Relaxing the modality gap with assignment prediction.
Image captioning with frozen LLMs
Experiments
Experimental settings
Datasets.
Model architecture.
Zero-shot image captioning
Visual question answering
Cross-modal retrieval
...and 23 more sections

Figures (8)

Figure 1: Overview of VLAP. We train a single linear layer following two learning objectives: Assignments prediction to bridge the modality gap between the visual and text representations; and image captioning to yield the generative capability of frozen LLMs.
Figure 2: Assignment prediction. The modality gap can be relaxed by predicting the word assignments of one modality from the other modality representations.
Figure 3: Selected examples from VLAP for vision-language tasks, including (a) zero-shot image captioning, (b) visual question answering (VQA), (c) visual dialog, and (d) text-to-image (T2I) retrieval.
Figure 4: Selected examples for visual semantic arithmetic.
Figure 5: Illustrations for the inference on (a) image captioning, (b) visual question answering, (c) visual dialog, and (d) text-to-image retrieval.
...and 3 more figures

Bridging Vision and Language Spaces with Assignment Prediction

TL;DR

Abstract

Bridging Vision and Language Spaces with Assignment Prediction

Authors

TL;DR

Abstract

Table of Contents

Figures (8)