Table of Contents
Fetching ...

MATE: Meet At The Embedding -- Connecting Images with Long Texts

Young Kyun Jang, Junmo Kang, Yong Jae Lee, Donghyun Kim

TL;DR

MATE addresses the challenge of linking images with long texts by fusing a Vision-Language Model (VLM) image encoder with a pretrained Large Language Model (LLM) encoder via a learnable projection module. It introduces a two-stage, multi-stage alignment: first aligning the VLM text embeddings with the LLM space using large-scale captions and query-document data, then adapting the same projection to align image embeddings with the LLM space using minimal image-caption data and LoRA-based fine-tuning. The approach enables image-long text retrieval without requires image-long text pairs and is validated on new benchmarks for image-lengthy captions and image-document retrieval, where MATE consistently outperforms baselines such as CLIP, Long-CLIP, ALIGN, and BLIP. This work broadens cross-modal retrieval capabilities, offering practical benefits for multilingual and multi-domain applications, while acknowledging limitations in projection-based alignment and highlighting avenues for future research.

Abstract

While advancements in Vision Language Models (VLMs) have significantly improved the alignment of visual and textual data, these models primarily focus on aligning images with short descriptive captions. This focus limits their ability to handle complex text interactions, particularly with longer texts such as lengthy captions or documents, which have not been extensively explored yet. In this paper, we introduce Meet At The Embedding (MATE), a novel approach that combines the capabilities of VLMs with Large Language Models (LLMs) to overcome this challenge without the need for additional image-long text pairs. Specifically, we replace the text encoder of the VLM with a pretrained LLM-based encoder that excels in understanding long texts. To bridge the gap between VLM and LLM, MATE incorporates a projection module that is trained in a multi-stage manner. It starts by aligning the embeddings from the VLM text encoder with those from the LLM using extensive text pairs. This module is then employed to seamlessly align image embeddings closely with LLM embeddings. We propose two new cross-modal retrieval benchmarks to assess the task of connecting images with long texts (lengthy captions / documents). Extensive experimental results demonstrate that MATE effectively connects images with long texts, uncovering diverse semantic relationships.

MATE: Meet At The Embedding -- Connecting Images with Long Texts

TL;DR

MATE addresses the challenge of linking images with long texts by fusing a Vision-Language Model (VLM) image encoder with a pretrained Large Language Model (LLM) encoder via a learnable projection module. It introduces a two-stage, multi-stage alignment: first aligning the VLM text embeddings with the LLM space using large-scale captions and query-document data, then adapting the same projection to align image embeddings with the LLM space using minimal image-caption data and LoRA-based fine-tuning. The approach enables image-long text retrieval without requires image-long text pairs and is validated on new benchmarks for image-lengthy captions and image-document retrieval, where MATE consistently outperforms baselines such as CLIP, Long-CLIP, ALIGN, and BLIP. This work broadens cross-modal retrieval capabilities, offering practical benefits for multilingual and multi-domain applications, while acknowledging limitations in projection-based alignment and highlighting avenues for future research.

Abstract

While advancements in Vision Language Models (VLMs) have significantly improved the alignment of visual and textual data, these models primarily focus on aligning images with short descriptive captions. This focus limits their ability to handle complex text interactions, particularly with longer texts such as lengthy captions or documents, which have not been extensively explored yet. In this paper, we introduce Meet At The Embedding (MATE), a novel approach that combines the capabilities of VLMs with Large Language Models (LLMs) to overcome this challenge without the need for additional image-long text pairs. Specifically, we replace the text encoder of the VLM with a pretrained LLM-based encoder that excels in understanding long texts. To bridge the gap between VLM and LLM, MATE incorporates a projection module that is trained in a multi-stage manner. It starts by aligning the embeddings from the VLM text encoder with those from the LLM using extensive text pairs. This module is then employed to seamlessly align image embeddings closely with LLM embeddings. We propose two new cross-modal retrieval benchmarks to assess the task of connecting images with long texts (lengthy captions / documents). Extensive experimental results demonstrate that MATE effectively connects images with long texts, uncovering diverse semantic relationships.
Paper Structure (18 sections, 3 equations, 9 figures, 6 tables)

This paper contains 18 sections, 3 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: A long text can be linked with different images (above) and an image can be associated with various domains of texts (below). To facilitate these cross-modal interactions, it is essential to establish a robust connection between the embeddings of individual modality samples, while ensuring that both are contextually aligned and semantically rich.
  • Figure 2: Training pipeline of MATE: Two separate stages are applied with text-only or image-text pairs.
  • Figure 3: Measuring alignment between embeddings of VLM image with VLM text (VLM-I to VLM-T), and VLM image with LLM text (VLM-I to LLM). The higher score indicates a closer alignment.
  • Figure 4: Examples of DOCCI test set of image-human annotated lengthy caption pairs.
  • Figure 5: Examples of CC3M-long test set of image-generated lengthy caption pairs.
  • ...and 4 more figures