Table of Contents
Fetching ...

Compressible and Searchable: AI-native Multi-Modal Retrieval System with Learned Image Compression

Jixiang Luo

TL;DR

The paper tackles scalable storage and retrieval of multi-modal data by fusing Learned Image Compression (LIC) with AI-native, CLIP-based cross-modal search. It proposes a multi-scale adapter that bridges LIC features to CLIP representations, enabling joint compression and semantically faithful retrieval. Through a two-phase MSCOCO/Kodak-based evaluation, it demonstrates modest bitrate costs with improved top-k search accuracy and provides ablations showing the encoder/decoder roles. This approach offers a practical path toward scalable, multi-modal databases that support text-to-image, image-to-image, and large-model prediction retrievals with reduced storage and computation.

Abstract

The burgeoning volume of digital content across diverse modalities necessitates efficient storage and retrieval methods. Conventional approaches struggle to cope with the escalating complexity and scale of multimedia data. In this paper, we proposed framework addresses this challenge by fusing AI-native multi-modal search capabilities with neural image compression. First we analyze the intricate relationship between compressibility and searchability, recognizing the pivotal role each plays in the efficiency of storage and retrieval systems. Through the usage of simple adapter is to bridge the feature of Learned Image Compression(LIC) and Contrastive Language-Image Pretraining(CLIP) while retaining semantic fidelity and retrieval of multi-modal data. Experimental evaluations on Kodak datasets demonstrate the efficacy of our approach, showcasing significant enhancements in compression efficiency and search accuracy compared to existing methodologies. Our work marks a significant advancement towards scalable and efficient multi-modal search systems in the era of big data.

Compressible and Searchable: AI-native Multi-Modal Retrieval System with Learned Image Compression

TL;DR

The paper tackles scalable storage and retrieval of multi-modal data by fusing Learned Image Compression (LIC) with AI-native, CLIP-based cross-modal search. It proposes a multi-scale adapter that bridges LIC features to CLIP representations, enabling joint compression and semantically faithful retrieval. Through a two-phase MSCOCO/Kodak-based evaluation, it demonstrates modest bitrate costs with improved top-k search accuracy and provides ablations showing the encoder/decoder roles. This approach offers a practical path toward scalable, multi-modal databases that support text-to-image, image-to-image, and large-model prediction retrievals with reduced storage and computation.

Abstract

The burgeoning volume of digital content across diverse modalities necessitates efficient storage and retrieval methods. Conventional approaches struggle to cope with the escalating complexity and scale of multimedia data. In this paper, we proposed framework addresses this challenge by fusing AI-native multi-modal search capabilities with neural image compression. First we analyze the intricate relationship between compressibility and searchability, recognizing the pivotal role each plays in the efficiency of storage and retrieval systems. Through the usage of simple adapter is to bridge the feature of Learned Image Compression(LIC) and Contrastive Language-Image Pretraining(CLIP) while retaining semantic fidelity and retrieval of multi-modal data. Experimental evaluations on Kodak datasets demonstrate the efficacy of our approach, showcasing significant enhancements in compression efficiency and search accuracy compared to existing methodologies. Our work marks a significant advancement towards scalable and efficient multi-modal search systems in the era of big data.
Paper Structure (10 sections, 7 equations, 5 figures, 2 tables)

This paper contains 10 sections, 7 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Compressible and Searchable
  • Figure 2: Left: the analysis of encoder and decoder. Right: the strategy we use to combine LIC and CLIP
  • Figure 3: Left: the pipeline of traditional retrieval. Right: the pipeline of AI-native retrieval
  • Figure 4: The retrieval result compared to CLIP
  • Figure 5: The retrieval result compared to CLIP