Table of Contents
Fetching ...

A Survey of Multimodal Composite Editing and Retrieval

Suyan Li, Fuxiang Huang, Lei Zhang

TL;DR

This survey addresses the problem of retrieving and editing multimedia content across multiple modalities by introducing a three-part taxonomy: image-text composite editing, image-text composite retrieval, and other multimodal retrieval. It systematically catalogs methods—from GAN-based and diffusion-based editing to CNN-, transformer-, and VLP-based retrieval, as well as hybrid approaches—highlighting a shift toward large-scale, cross-modal pre-trained models and zero-shot capabilities. The authors compile over 130 methods, summarize benchmarks and experimental results, and discuss challenges such as modality gaps, robustness, and scalability, offering concrete directions for future work. The work provides a valuable framework and a public project page to help researchers track progress and compare techniques across evolving multimodal tasks with practical implications for search, content creation, and contextual retrieval.

Abstract

In the real world, where information is abundant and diverse across different modalities, understanding and utilizing various data types to improve retrieval systems is a key focus of research. Multimodal composite retrieval integrates diverse modalities such as text, image and audio, etc. to provide more accurate, personalized, and contextually relevant results. To facilitate a deeper understanding of this promising direction, this survey explores multimodal composite editing and retrieval in depth, covering image-text composite editing, image-text composite retrieval, and other multimodal composite retrieval. In this survey, we systematically organize the application scenarios, methods, benchmarks, experiments, and future directions. Multimodal learning is a hot topic in large model era, and have also witnessed some surveys in multimodal learning and vision-language models with transformers published in the PAMI journal. To the best of our knowledge, this survey is the first comprehensive review of the literature on multimodal composite retrieval, which is a timely complement of multimodal fusion to existing reviews. To help readers' quickly track this field, we build the project page for this survey, which can be found at https://github.com/fuxianghuang1/Multimodal-Composite-Editing-and-Retrieval.

A Survey of Multimodal Composite Editing and Retrieval

TL;DR

This survey addresses the problem of retrieving and editing multimedia content across multiple modalities by introducing a three-part taxonomy: image-text composite editing, image-text composite retrieval, and other multimodal retrieval. It systematically catalogs methods—from GAN-based and diffusion-based editing to CNN-, transformer-, and VLP-based retrieval, as well as hybrid approaches—highlighting a shift toward large-scale, cross-modal pre-trained models and zero-shot capabilities. The authors compile over 130 methods, summarize benchmarks and experimental results, and discuss challenges such as modality gaps, robustness, and scalability, offering concrete directions for future work. The work provides a valuable framework and a public project page to help researchers track progress and compare techniques across evolving multimodal tasks with practical implications for search, content creation, and contextual retrieval.

Abstract

In the real world, where information is abundant and diverse across different modalities, understanding and utilizing various data types to improve retrieval systems is a key focus of research. Multimodal composite retrieval integrates diverse modalities such as text, image and audio, etc. to provide more accurate, personalized, and contextually relevant results. To facilitate a deeper understanding of this promising direction, this survey explores multimodal composite editing and retrieval in depth, covering image-text composite editing, image-text composite retrieval, and other multimodal composite retrieval. In this survey, we systematically organize the application scenarios, methods, benchmarks, experiments, and future directions. Multimodal learning is a hot topic in large model era, and have also witnessed some surveys in multimodal learning and vision-language models with transformers published in the PAMI journal. To the best of our knowledge, this survey is the first comprehensive review of the literature on multimodal composite retrieval, which is a timely complement of multimodal fusion to existing reviews. To help readers' quickly track this field, we build the project page for this survey, which can be found at https://github.com/fuxianghuang1/Multimodal-Composite-Editing-and-Retrieval.
Paper Structure (24 sections, 3 figures, 11 tables)

This paper contains 24 sections, 3 figures, 11 tables.

Figures (3)

  • Figure 1: The examples of multimodal composite image retrieval (MCIR) task.
  • Figure 2: A new taxonomy of multimodal composite editing and retrieval approaches, from three orthogonal aspects in this survey.
  • Figure 3: The illustration of the basic technical framework of image-text composite retrieval.