Table of Contents
Fetching ...

Caption Injection for Optimization in Generative Search Engine

Xiaolu Chen, Yong Liao

TL;DR

The paper addresses the need to optimize subjective content visibility in Generative Search Engines (GSEs) by extending beyond text-only optimization to multimodal optimization. It introduces Caption Injection, a three-stage, prompt-driven pipeline that maps visual semantics from images into textual content to enhance cross-modal optimization in MRAG settings. The method extends G-SEO from unimodal to multimodal contexts and is evaluated on the MRAMG benchmark using the G-Eval framework, showing consistent improvements over text-based baselines. The work demonstrates the practical value of cross-modal semantic fusion for user-perceived content visibility and outlines avenues for deeper cross-modal fusion and cross-model adaptation in future research.

Abstract

Generative Search Engines (GSEs) leverage Retrieval-Augmented Generation (RAG) techniques and Large Language Models (LLMs) to integrate multi-source information and provide users with accurate and comprehensive responses. Unlike traditional search engines that present results in ranked lists, GSEs shift users' attention from sequential browsing to content-driven subjective perception, driving a paradigm shift in information retrieval. In this context, enhancing the subjective visibility of content through Generative Search Engine Optimization (G-SEO) methods has emerged as a new research focus. With the rapid advancement of Multimodal Retrieval-Augmented Generation (MRAG) techniques, GSEs can now efficiently integrate text, images, audio, and video, producing richer responses that better satisfy complex information needs. Existing G-SEO methods, however, remain limited to text-based optimization and fail to fully exploit multimodal data. To address this gap, we propose Caption Injection, the first multimodal G-SEO approach, which extracts captions from images and injects them into textual content, integrating visual semantics to enhance the subjective visibility of content in generative search scenarios. We systematically evaluate Caption Injection on MRAMG, a benchmark for MRAG, under both unimodal and multimodal settings. Experimental results show that Caption Injection significantly outperforms text-only G-SEO baselines under the G-Eval metric, demonstrating the necessity and effectiveness of multimodal integration in G-SEO to improve user-perceived content visibility.

Caption Injection for Optimization in Generative Search Engine

TL;DR

The paper addresses the need to optimize subjective content visibility in Generative Search Engines (GSEs) by extending beyond text-only optimization to multimodal optimization. It introduces Caption Injection, a three-stage, prompt-driven pipeline that maps visual semantics from images into textual content to enhance cross-modal optimization in MRAG settings. The method extends G-SEO from unimodal to multimodal contexts and is evaluated on the MRAMG benchmark using the G-Eval framework, showing consistent improvements over text-based baselines. The work demonstrates the practical value of cross-modal semantic fusion for user-perceived content visibility and outlines avenues for deeper cross-modal fusion and cross-model adaptation in future research.

Abstract

Generative Search Engines (GSEs) leverage Retrieval-Augmented Generation (RAG) techniques and Large Language Models (LLMs) to integrate multi-source information and provide users with accurate and comprehensive responses. Unlike traditional search engines that present results in ranked lists, GSEs shift users' attention from sequential browsing to content-driven subjective perception, driving a paradigm shift in information retrieval. In this context, enhancing the subjective visibility of content through Generative Search Engine Optimization (G-SEO) methods has emerged as a new research focus. With the rapid advancement of Multimodal Retrieval-Augmented Generation (MRAG) techniques, GSEs can now efficiently integrate text, images, audio, and video, producing richer responses that better satisfy complex information needs. Existing G-SEO methods, however, remain limited to text-based optimization and fail to fully exploit multimodal data. To address this gap, we propose Caption Injection, the first multimodal G-SEO approach, which extracts captions from images and injects them into textual content, integrating visual semantics to enhance the subjective visibility of content in generative search scenarios. We systematically evaluate Caption Injection on MRAMG, a benchmark for MRAG, under both unimodal and multimodal settings. Experimental results show that Caption Injection significantly outperforms text-only G-SEO baselines under the G-Eval metric, demonstrating the necessity and effectiveness of multimodal integration in G-SEO to improve user-perceived content visibility.

Paper Structure

This paper contains 20 sections, 1 equation, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Comparison of result presentation across different types of search engines. Traditional search engines (blue section on the left) display retrieved web content sources in a ranked list, where higher-ranked results are typically more relevant to the query. GSEs retrieve relevant content sources and leverage LLMs to generate comprehensive responses with cited references. Compared with unimodal GSEs (green section in the middle) that process only textual information, multimodal GSEs (yellow section on the right) jointly interpret textual and visual information, producing responses with richer semantics and higher information density.
  • Figure 2: Illustration of the Caption Injection pipeline. The image of the web content source is first captioned by visual-language models (VLMs), mapping visual representations into the natural language space. The textual content is then injected with the rewritten caption leveraging LLMs, enabling G-SEO with integrated multimodal information.