AutoS$^2$earch: Unlocking the Reasoning Potential of Large Models for Web-based Source Search
Zhengqiu Zhu, Yatai Ji, Jiaheng Huang, Yong Zhao, Sihang Qiu, Rusheng Ju
TL;DR
The paper addresses the challenge of embedding source search within web-based risk management systems to enable timely hazard localization. It introduces AutoS$^2$earch, a zero-shot framework that employs a multi-modal LLM to translate visual observations into language and uses chain-of-thought reasoning to choose among four directional actions, while preserving the underlying search algorithm (Infotaxis). The results show AutoS$^2$earch achieving 95–98% of the effectiveness of human-AI collaborative search across 20 benchmarks, with substantial reductions in labor costs and faster response times. The work demonstrates robustness across multiple MLLMs and LLMs and discusses limitations, including environmental complexity and potential hallucinations, while outlining future directions for dynamic environments, visual-thinking augmentation, and human-AI alignment. Overall, the approach highlights how web engineering and large-model reasoning can enable autonomous, scalable source-search capabilities in industrial settings.
Abstract
Web-based management systems have been widely used in risk control and industrial safety. However, effectively integrating source search capabilities into these systems, to enable decision-makers to locate and address the hazard (e.g., gas leak detection) remains a challenge. While prior efforts have explored using web crowdsourcing and AI algorithms for source search decision support, these approaches suffer from overheads in recruiting human participants and slow response times in time-sensitive situations. To address this, we introduce AutoS$^2$earch, a novel framework leveraging large models for zero-shot source search in web applications. AutoS$^2$earch operates on a simplified visual environment projected through a web-based display, utilizing a chain-of-thought prompt designed to emulate human reasoning. The multi-modal large language model (MLLMs) dynamically converts visual observations into language descriptions, enabling the LLM to perform linguistic reasoning on four directional choices. Extensive experiments demonstrate that AutoS$^2$earch achieves performance nearly equivalent to human-AI collaborative source search while eliminating dependency on crowdsourced labor. Our work offers valuable insights in using web engineering to design such autonomous systems in other industrial applications.
