Table of Contents
Fetching ...

Multimodal Query Suggestion with Multi-Agent Reinforcement Learning from Human Feedback

Zheng Wang, Bingzheng Gan, Wei Shi

TL;DR

The paper tackles multimodal query suggestion by generating text-based queries from user query images, focusing on intentionality and diversity. It introduces RL4Sugg, a two-agent framework where Agent-I optimizes intentionality via RewardNet and PolicyNet and Agent-D ensures diversity through a diversity-focused policy, all trained with RLHF and LLMs. Across two real-world datasets, RL4Sugg achieves an 18% improvement over strong baselines in generation and better retrieval metrics, with successful deployment in production helping to boost user engagement. The approach demonstrates a practical path for integrating multimodal cues into search engine query formulation and suggests future work extending to additional modalities like audio and video.

Abstract

In the rapidly evolving landscape of information retrieval, search engines strive to provide more personalized and relevant results to users. Query suggestion systems play a crucial role in achieving this goal by assisting users in formulating effective queries. However, existing query suggestion systems mainly rely on textual inputs, potentially limiting user search experiences for querying images. In this paper, we introduce a novel Multimodal Query Suggestion (MMQS) task, which aims to generate query suggestions based on user query images to improve the intentionality and diversity of search results. We present the RL4Sugg framework, leveraging the power of Large Language Models (LLMs) with Multi-Agent Reinforcement Learning from Human Feedback to optimize the generation process. Through comprehensive experiments, we validate the effectiveness of RL4Sugg, demonstrating a 18% improvement compared to the best existing approach. Moreover, the MMQS has been transferred into real-world search engine products, which yield enhanced user engagement. Our research advances query suggestion systems and provides a new perspective on multimodal information retrieval.

Multimodal Query Suggestion with Multi-Agent Reinforcement Learning from Human Feedback

TL;DR

The paper tackles multimodal query suggestion by generating text-based queries from user query images, focusing on intentionality and diversity. It introduces RL4Sugg, a two-agent framework where Agent-I optimizes intentionality via RewardNet and PolicyNet and Agent-D ensures diversity through a diversity-focused policy, all trained with RLHF and LLMs. Across two real-world datasets, RL4Sugg achieves an 18% improvement over strong baselines in generation and better retrieval metrics, with successful deployment in production helping to boost user engagement. The approach demonstrates a practical path for integrating multimodal cues into search engine query formulation and suggests future work extending to additional modalities like audio and video.

Abstract

In the rapidly evolving landscape of information retrieval, search engines strive to provide more personalized and relevant results to users. Query suggestion systems play a crucial role in achieving this goal by assisting users in formulating effective queries. However, existing query suggestion systems mainly rely on textual inputs, potentially limiting user search experiences for querying images. In this paper, we introduce a novel Multimodal Query Suggestion (MMQS) task, which aims to generate query suggestions based on user query images to improve the intentionality and diversity of search results. We present the RL4Sugg framework, leveraging the power of Large Language Models (LLMs) with Multi-Agent Reinforcement Learning from Human Feedback to optimize the generation process. Through comprehensive experiments, we validate the effectiveness of RL4Sugg, demonstrating a 18% improvement compared to the best existing approach. Moreover, the MMQS has been transferred into real-world search engine products, which yield enhanced user engagement. Our research advances query suggestion systems and provides a new perspective on multimodal information retrieval.
Paper Structure (24 sections, 19 equations, 2 figures, 11 tables)

This paper contains 24 sections, 19 equations, 2 figures, 11 tables.

Figures (2)

  • Figure 1: Illustration of MMQS problem.
  • Figure 2: Training overview of Agent-I and Agent-D. Agent-I trains the RewardNet on three tasks (ISA, ISG, ISM) using learnable query embeddings, while the PolicyNet is trained with RLHF to generate candidate suggestions $S'_1, S'_2, ..., S'_N$ for intentionality. Agent-D learns to select diverse suggestions from the candidates via policy gradient and outputs the final $K$ suggestions.