Table of Contents
Fetching ...

VIRTUE: Visual-Interactive Text-Image Universal Embedder

Wei-Yao Wang, Kazuya Tateishi, Qiyu Wu, Shusuke Takahashi, Yuki Mitsufuji

TL;DR

VIRTUE addresses the lack of visual-interactive capabilities in embedding models by combining a segmentation model (SAM-2) with a pretrained vision-language model to jointly encode entity- and global-context information. It introduces a segmentation-aware prompt pathway and a segmentation-language connector, enabling visual prompts to influence embeddings, trained with a contrastive objective. The SCaR benchmark provides 1M samples for visual-interactive image-to-text retrieval to evaluate region-grounded, compositional reasoning, and VIRTUE achieves state-of-the-art performance on MMEB and SCaR, demonstrating improved accuracy and robustness. Together, these contributions enable more precise, region-aware multimodal retrieval and learning, with practical implications for interactive AI systems and grounded reasoning in vision-language tasks.

Abstract

Multimodal representation learning models have demonstrated successful operation across complex tasks, and the integration of vision-language models (VLMs) has further enabled embedding models with instruction-following capabilities. However, existing embedding models lack visual-interactive capabilities to specify regions of interest from users (e.g., point, bounding box, mask), which have been explored in generative models to broaden their human-interactive applicability. Equipping embedding models with visual interactions not only would unlock new applications with localized grounding of user intent, which remains unexplored, but also enable the models to learn entity-level information within images to complement their global representations for conventional embedding tasks. In this paper, we propose a novel Visual-InteRactive Text-Image Universal Embedder (VIRTUE) that extends the capabilities of the segmentation model and the vision-language model to the realm of representation learning. In VIRTUE, the segmentation model can process visual prompts that pinpoint specific regions within an image, thereby enabling the embedder to handle complex and ambiguous scenarios more precisely. To evaluate the visual-interaction ability of VIRTUE, we introduce a large-scale Segmentation-and-Scene Caption Retrieval (SCaR) benchmark comprising 1M samples that aims to retrieve the text caption by jointly considering the entity with a specific object and image scene. VIRTUE consistently achieves a state-of-the-art performance with significant improvements across 36 universal MMEB (3.1%-8.5%) and five visual-interactive SCaR (15.2%-20.3%) tasks.

VIRTUE: Visual-Interactive Text-Image Universal Embedder

TL;DR

VIRTUE addresses the lack of visual-interactive capabilities in embedding models by combining a segmentation model (SAM-2) with a pretrained vision-language model to jointly encode entity- and global-context information. It introduces a segmentation-aware prompt pathway and a segmentation-language connector, enabling visual prompts to influence embeddings, trained with a contrastive objective. The SCaR benchmark provides 1M samples for visual-interactive image-to-text retrieval to evaluate region-grounded, compositional reasoning, and VIRTUE achieves state-of-the-art performance on MMEB and SCaR, demonstrating improved accuracy and robustness. Together, these contributions enable more precise, region-aware multimodal retrieval and learning, with practical implications for interactive AI systems and grounded reasoning in vision-language tasks.

Abstract

Multimodal representation learning models have demonstrated successful operation across complex tasks, and the integration of vision-language models (VLMs) has further enabled embedding models with instruction-following capabilities. However, existing embedding models lack visual-interactive capabilities to specify regions of interest from users (e.g., point, bounding box, mask), which have been explored in generative models to broaden their human-interactive applicability. Equipping embedding models with visual interactions not only would unlock new applications with localized grounding of user intent, which remains unexplored, but also enable the models to learn entity-level information within images to complement their global representations for conventional embedding tasks. In this paper, we propose a novel Visual-InteRactive Text-Image Universal Embedder (VIRTUE) that extends the capabilities of the segmentation model and the vision-language model to the realm of representation learning. In VIRTUE, the segmentation model can process visual prompts that pinpoint specific regions within an image, thereby enabling the embedder to handle complex and ambiguous scenarios more precisely. To evaluate the visual-interaction ability of VIRTUE, we introduce a large-scale Segmentation-and-Scene Caption Retrieval (SCaR) benchmark comprising 1M samples that aims to retrieve the text caption by jointly considering the entity with a specific object and image scene. VIRTUE consistently achieves a state-of-the-art performance with significant improvements across 36 universal MMEB (3.1%-8.5%) and five visual-interactive SCaR (15.2%-20.3%) tasks.

Paper Structure

This paper contains 32 sections, 2 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Visual-interactive paradigms for image-to-image (I2I) with masks assuming candidate images contain only dogs or only cats across different scenes, and image-to-text (I2T) with bounding boxes. False retrievals occur when retrieved content does not match the query’s scene context.
  • Figure 2: The data collection pipeline to build SCaR. We adopt GPT-4V to generate missing elements for the ground-truth caption as well as negative candidates. Collected samples (left) are filtered via LLM-then-human inspection (right) to ensure quality. Each SCaR sample contains an image with a bounding box, one ground-truth caption, and nine distractors.
  • Figure 3: Overview of VIRTUE. The framework trained with contrastive loss consists of a segmentation model, a segmentation-language connector (orange), and a VLM (blue). It supports arbitrary combinations of visual and textual inputs with an optional visual prompt. If no prompt is provided, the model samples $N$ points uniformly from the image to extract entity-level information.
  • Figure 4: The prompt template used for constructing our SCaR benchmark with GPT-4V where the text in red varies for each sample.
  • Figure 5: The prompt template used for verifying the collected samples via LLM-based filtering.
  • ...and 6 more figures