Table of Contents
Fetching ...

Rethinking Sparse Lexical Representations for Image Retrieval in the Age of Rising Multi-Modal Large Language Models

Kengo Nakata, Daisuke Miyashita, Youyang Ng, Yasuto Hoshi, Jun Deguchi

TL;DR

This paper addresses image retrieval using sparse lexical representations by leveraging multi-modal LLMs to translate visual content into textual data that can be indexed with NLP-style sparse retrieval. The approach relies on pre-trained M-LLMs (notably LLaVA) with visual prompting to produce captions and tags, augmented by fixed-pattern cropping to expand the feature set, and evaluated under a keyword-based, text-to-image retrieval setting. Across MS-COCO, PASCAL VOC, and NUS-WIDE, the method achieves higher precision and recall than conventional vision-language model baselines, with PR-AUC improvements amplified by cropping and iterative keyword expansion. CLIPScore analysis validates the cropping strategy, and the results imply practical benefits for efficient, scalable image search when user queries are keyword-centric.

Abstract

In this paper, we rethink sparse lexical representations for image retrieval. By utilizing multi-modal large language models (M-LLMs) that support visual prompting, we can extract image features and convert them into textual data, enabling us to utilize efficient sparse retrieval algorithms employed in natural language processing for image retrieval tasks. To assist the LLM in extracting image features, we apply data augmentation techniques for key expansion and analyze the impact with a metric for relevance between images and textual data. We empirically show the superior precision and recall performance of our image retrieval method compared to conventional vision-language model-based methods on the MS-COCO, PASCAL VOC, and NUS-WIDE datasets in a keyword-based image retrieval scenario, where keywords serve as search queries. We also demonstrate that the retrieval performance can be improved by iteratively incorporating keywords into search queries.

Rethinking Sparse Lexical Representations for Image Retrieval in the Age of Rising Multi-Modal Large Language Models

TL;DR

This paper addresses image retrieval using sparse lexical representations by leveraging multi-modal LLMs to translate visual content into textual data that can be indexed with NLP-style sparse retrieval. The approach relies on pre-trained M-LLMs (notably LLaVA) with visual prompting to produce captions and tags, augmented by fixed-pattern cropping to expand the feature set, and evaluated under a keyword-based, text-to-image retrieval setting. Across MS-COCO, PASCAL VOC, and NUS-WIDE, the method achieves higher precision and recall than conventional vision-language model baselines, with PR-AUC improvements amplified by cropping and iterative keyword expansion. CLIPScore analysis validates the cropping strategy, and the results imply practical benefits for efficient, scalable image search when user queries are keyword-centric.

Abstract

In this paper, we rethink sparse lexical representations for image retrieval. By utilizing multi-modal large language models (M-LLMs) that support visual prompting, we can extract image features and convert them into textual data, enabling us to utilize efficient sparse retrieval algorithms employed in natural language processing for image retrieval tasks. To assist the LLM in extracting image features, we apply data augmentation techniques for key expansion and analyze the impact with a metric for relevance between images and textual data. We empirically show the superior precision and recall performance of our image retrieval method compared to conventional vision-language model-based methods on the MS-COCO, PASCAL VOC, and NUS-WIDE datasets in a keyword-based image retrieval scenario, where keywords serve as search queries. We also demonstrate that the retrieval performance can be improved by iteratively incorporating keywords into search queries.
Paper Structure (3 sections, 2 equations, 2 figures)

This paper contains 3 sections, 2 equations, 2 figures.

Figures (2)

  • Figure 1: Data augmentation techniques for key expansion. An original image is segmented into multiple regions as cropped images (left), and each cropped image is processed by an M-LLM to generate captions that extract the features of each region (right). By concatenating the generated captions, including those derived from the original image, we can extract a comprehensive set of features from the whole image.
  • Figure 2: The variations in averaged CLIPScore based on Eq. \ref{['eq:averaged_clipscore_each']} for each of the 5,000 validation images from the MS-COCO dataset. As shown in the left figure, the original images are cropped by fixed patterns including overlaps. In the upper right graph, the values are sorted by averaged CLIPScore for each image in descending order. The lower right table summarizes averaged CLIPScore based on Eq. \ref{['eq:averaged_clipscore']} for all the images in the dataset, along with the top-50 recall performance (R@50).