Table of Contents
Fetching ...

Enhancing LLMs' Reasoning-Intensive Multimedia Search Capabilities through Fine-Tuning and Reinforcement Learning

Jinzheng Li, Sibo Ju, Yanzhou Su, Hongguang Li, Yiqing Shen

TL;DR

This work tackles the limitations of prompting-based, multimodal search by introducing SearchExpert, a two-stage training framework that combines a token-efficient natural language DAG representation (Rep$(G)$) with supervised fine-tuning for searching (SFTS) and reinforcement learning from search feedback (RLSF). A multimedia understanding and generation agent extends the LLM’s capabilities to process visual inputs and generate visuals (via BLIP-2 and DALLE-3) during inference, while an automated pipeline builds the SearchExpertBench-25 benchmark to evaluate reasoning in multimedia search. Empirical results show substantial improvements in accuracy over baselines on FinSearchBench-24 and the new SearchExpertBench-25, along with notable token-efficiency gains and favorable human judgments; ablations confirm the complementary benefits of SFTS and RLSF. Together, these contributions enable more robust, scalable reasoning across complex, time-sensitive, and multimodal search tasks with practical implications for real-world information retrieval and decision support.

Abstract

Existing large language models (LLMs) driven search agents typically rely on prompt engineering to decouple the user queries into search plans, limiting their effectiveness in complex scenarios requiring reasoning. Furthermore, they suffer from excessive token consumption due to Python-based search plan representations and inadequate integration of multimedia elements for both input processing and response generation. To address these challenges, we introduce SearchExpert, a training method for LLMs to improve their multimedia search capabilities in response to complex search queries. Firstly, we reformulate the search plan in an efficient natural language representation to reduce token consumption. Then, we propose the supervised fine-tuning for searching (SFTS) to fine-tune LLM to adapt to these representations, together with an automated dataset construction pipeline. Secondly, to improve reasoning-intensive search capabilities, we propose the reinforcement learning from search feedback (RLSF) that takes the search results planned by LLM as the reward signals. Thirdly, we propose a multimedia understanding and generation agent that enables the fine-tuned LLM to process visual input and produce visual output during inference. Finally, we establish an automated benchmark construction pipeline and a human evaluation framework. Our resulting benchmark, SearchExpertBench-25, comprises 200 multiple-choice questions spanning financial and international news scenarios that require reasoning in searching. Experiments demonstrate that SearchExpert outperforms the commercial LLM search method (Perplexity Pro) by 36.60% on the existing FinSearchBench-24 benchmark and 54.54% on our proposed SearchExpertBench-25. Human evaluations further confirm the superior readability.

Enhancing LLMs' Reasoning-Intensive Multimedia Search Capabilities through Fine-Tuning and Reinforcement Learning

TL;DR

This work tackles the limitations of prompting-based, multimodal search by introducing SearchExpert, a two-stage training framework that combines a token-efficient natural language DAG representation (Rep) with supervised fine-tuning for searching (SFTS) and reinforcement learning from search feedback (RLSF). A multimedia understanding and generation agent extends the LLM’s capabilities to process visual inputs and generate visuals (via BLIP-2 and DALLE-3) during inference, while an automated pipeline builds the SearchExpertBench-25 benchmark to evaluate reasoning in multimedia search. Empirical results show substantial improvements in accuracy over baselines on FinSearchBench-24 and the new SearchExpertBench-25, along with notable token-efficiency gains and favorable human judgments; ablations confirm the complementary benefits of SFTS and RLSF. Together, these contributions enable more robust, scalable reasoning across complex, time-sensitive, and multimodal search tasks with practical implications for real-world information retrieval and decision support.

Abstract

Existing large language models (LLMs) driven search agents typically rely on prompt engineering to decouple the user queries into search plans, limiting their effectiveness in complex scenarios requiring reasoning. Furthermore, they suffer from excessive token consumption due to Python-based search plan representations and inadequate integration of multimedia elements for both input processing and response generation. To address these challenges, we introduce SearchExpert, a training method for LLMs to improve their multimedia search capabilities in response to complex search queries. Firstly, we reformulate the search plan in an efficient natural language representation to reduce token consumption. Then, we propose the supervised fine-tuning for searching (SFTS) to fine-tune LLM to adapt to these representations, together with an automated dataset construction pipeline. Secondly, to improve reasoning-intensive search capabilities, we propose the reinforcement learning from search feedback (RLSF) that takes the search results planned by LLM as the reward signals. Thirdly, we propose a multimedia understanding and generation agent that enables the fine-tuned LLM to process visual input and produce visual output during inference. Finally, we establish an automated benchmark construction pipeline and a human evaluation framework. Our resulting benchmark, SearchExpertBench-25, comprises 200 multiple-choice questions spanning financial and international news scenarios that require reasoning in searching. Experiments demonstrate that SearchExpert outperforms the commercial LLM search method (Perplexity Pro) by 36.60% on the existing FinSearchBench-24 benchmark and 54.54% on our proposed SearchExpertBench-25. Human evaluations further confirm the superior readability.

Paper Structure

This paper contains 18 sections, 10 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: The two-stage training framework of SearchExpert. (a) Supervised fine-tuning for searching (SFTS): Our automated data construction pipeline begins with Python crawlers collecting recent online content, followed by data cleaning and Q&A generation. We then generate efficient natural language DAG representations of search plans, converting from traditional Python-based formats to our token-efficient representation. The constructed training data is used to fine-tune LLMs to generate structured search plans directly from queries. (b) Reinforcement searching from search feedback (RLSF): Building upon SFTS, our RLSF approach uses similar initial data collection but focuses on constructing more complex, reasoning-intensive search scenarios. The trained models generate search plans that are executed to retrieve information, with search quality evaluated through our dual-component reward mechanism combining semantic similarity and intrinsic quality assessment. The resulting scores guide LLM optimization through PPO.
  • Figure 2: Benchmark construction method and example case from SearchExpertBench-25. The left panel illustrates our five-step pipeline for constructing reasoning-intensive search benchmarks. It begins with LLM-based identification of recent complex topics requiring multi-hop search and causal inference, followed by information retrieval, key information extraction, OpenAI deep research analysis, and finally, question generation. The right panel showcases an example benchmark question from SearchExpertBench-25 examining the global transmission mechanisms of the Federal Reserve's unexpected rate cut in February 2025. This complex scenario features multiple correct answers (A, D, E, F) with explicit reasoning paths for each option, demonstrating how our benchmark evaluates a model's ability to generate effective search plans for navigating ambiguous, multi-variable financial scenarios.
  • Figure 3: Comparison of LLM-driven search methods analyzing a hypothetical April 2025 earthquake in Japan's manufacturing hub. Our SearchExpert (SFTS+RLSF) demonstrates superior reasoning capabilities through efficient natural language DAG representation, producing more comprehensive, long-term strategic analysis with precise component-level impacts and quantitative forecasts. MindSearch offers moderate but simpler dimensional analysis, while FinSearch provides detailed data but more limited causal reasoning. Green annotations indicate accurate reasoning; red shows analytical limitations; blue highlights distinctive information obtained through superior search planning.
  • Figure 4: Human evaluation results comparing SearchExpert implementations across different model sizes against baseline search methods. The total score (red dotted line) reveals that even smaller SearchExpert implementations outperform existing search frameworks, confirming that our two-stage training approach enhances reasoning-intensive search abilities while maintaining strong performance across all human evaluation dimensions.