Table of Contents
Fetching ...

MERIT: Multilingual Semantic Retrieval with Interleaved Multi-Condition Query

Wei Chow, Yuan Gao, Linfeng Li, Xian Wang, Qi Xu, Hang Song, Lingdong Kong, Ran Zhou, Yi Zeng, Yidong Cai, Botian Jiang, Shilin Xu, Jiajun Zhang, Minghui Qiu, Xiangtai Li, Tianshu Yang, Siliang Tang, Juncheng Li

TL;DR

MERIT introduces a multilingual benchmark for interleaved multi-condition semantic retrieval and identifies critical gaps in existing methods that ignore conditional elements. It proposes Coral, a fine-tuning framework that combines embedding reconstruction with contrastive learning to preserve fine-grained attributes while capturing global semantics. Empirical results show Coral delivering substantial gains on MERIT and across eight standard benchmarks, demonstrating the approach's effectiveness and generalization. Collectively, the work lays a foundation for advancing interleaved multi-condition retrieval in multilingual, multimodal settings.

Abstract

Semantic retrieval is crucial for modern applications yet remains underexplored in current research. Existing datasets are limited to single languages, single images, or singular retrieval conditions, often failing to fully exploit the expressive capacity of visual information as evidenced by maintained performance when images are replaced with captions. However, practical retrieval scenarios frequently involve interleaved multi-condition queries with multiple images. Hence, this paper introduces MERIT, the first multilingual dataset for interleaved multi-condition semantic retrieval, comprising 320,000 queries with 135,000 products in 5 languages, covering 7 distinct product categories. Extensive experiments on MERIT identify existing models's limitation: focusing solely on global semantic information while neglecting specific conditional elements in queries. Consequently, we propose Coral, a novel fine-tuning framework that adapts pre-trained MLLMs by integrating embedding reconstruction to preserve fine-grained conditional elements and contrastive learning to extract comprehensive global semantics. Experiments demonstrate that Coral achieves a 45.9% performance improvement over conventional approaches on MERIT, with strong generalization capabilities validated across 8 established retrieval benchmarks. Collectively, our contributions - a novel dataset, identification of critical limitations in existing approaches, and an innovative fine-tuning framework - establish a foundation for future research in interleaved multi-condition semantic retrieval.

MERIT: Multilingual Semantic Retrieval with Interleaved Multi-Condition Query

TL;DR

MERIT introduces a multilingual benchmark for interleaved multi-condition semantic retrieval and identifies critical gaps in existing methods that ignore conditional elements. It proposes Coral, a fine-tuning framework that combines embedding reconstruction with contrastive learning to preserve fine-grained attributes while capturing global semantics. Empirical results show Coral delivering substantial gains on MERIT and across eight standard benchmarks, demonstrating the approach's effectiveness and generalization. Collectively, the work lays a foundation for advancing interleaved multi-condition retrieval in multilingual, multimodal settings.

Abstract

Semantic retrieval is crucial for modern applications yet remains underexplored in current research. Existing datasets are limited to single languages, single images, or singular retrieval conditions, often failing to fully exploit the expressive capacity of visual information as evidenced by maintained performance when images are replaced with captions. However, practical retrieval scenarios frequently involve interleaved multi-condition queries with multiple images. Hence, this paper introduces MERIT, the first multilingual dataset for interleaved multi-condition semantic retrieval, comprising 320,000 queries with 135,000 products in 5 languages, covering 7 distinct product categories. Extensive experiments on MERIT identify existing models's limitation: focusing solely on global semantic information while neglecting specific conditional elements in queries. Consequently, we propose Coral, a novel fine-tuning framework that adapts pre-trained MLLMs by integrating embedding reconstruction to preserve fine-grained conditional elements and contrastive learning to extract comprehensive global semantics. Experiments demonstrate that Coral achieves a 45.9% performance improvement over conventional approaches on MERIT, with strong generalization capabilities validated across 8 established retrieval benchmarks. Collectively, our contributions - a novel dataset, identification of critical limitations in existing approaches, and an innovative fine-tuning framework - establish a foundation for future research in interleaved multi-condition semantic retrieval.

Paper Structure

This paper contains 49 sections, 5 equations, 69 figures, 12 tables.

Figures (69)

  • Figure 1: Illustrative examples of interleaved multi-condition semantic retrieval. MERIT enables the first multilingual semantic retrieval with composite multi-condition queries that interleave textual descriptions and visual references, reflecting real-world product search scenarios where users specify multiple attributes through both text and images.
  • Figure 2: Comparisons among and existing datasets liu2021imagewu2021fashionzhang2024magiclens. Left: Previous works are limited to single-condition, single-image, single-language scenarios. Right: Our benchmark enables multilingual semantic retrieval, featuring composite multi-condition queries.
  • Figure 3: Dataset statistics.
  • Figure 4: Summary of product categories and language distributions.
  • Figure 5: The data annotation pipeline for MERIT. We ensure data diversity and quality through open-set deduplication and multi-round filtering procedures in $4$ steps. We first select high-quality products and annotate their attributes, then combine them into query pairs before performing data cleaning to produce MERIT. Details can be found in Appendix \ref{['app1']}.
  • ...and 64 more figures