Exploring the Implicit Semantic Ability of Multimodal Large Language Models: A Pilot Study on Entity Set Expansion
Hebin Wang, Yangning Li, Yinghui Li, Hai-Tao Zheng, Wenhao Jiang, Hong-Gee Kim
TL;DR
This work probes the implicit semantic reasoning of multimodal LLMs through the MESE task and presents LUSAR, a two-stage framework that first generates candidate entities under a prefix-tree constraint and then employs a listwise ranking prompting to derive a global ranking from multiple short lists. A listwise fine-tuning stage with LoRA, guided by GPT-4-generated data and safety-enhancing data, further strengthens ranking robustness. Experiments on the MESED dataset show that multimodal LLMs outperform unimodal baselines, and the LUSAR framework yields substantial gains in MESE metrics, representing the first application of generative MLLMs to ESE and extending listwise ranking to large models. The approach offers a practical pathway to unlock implicit semantic reasoning in cross-modal retrieval and related expansion tasks, with broad implications for downstream recommendations and semantic understanding.
Abstract
The rapid development of multimodal large language models (MLLMs) has brought significant improvements to a wide range of tasks in real-world applications. However, LLMs still exhibit certain limitations in extracting implicit semantic information. In this paper, we apply MLLMs to the Multi-modal Entity Set Expansion (MESE) task, which aims to expand a handful of seed entities with new entities belonging to the same semantic class, and multi-modal information is provided with each entity. We explore the capabilities of MLLMs to understand implicit semantic information at the entity-level granularity through the MESE task, introducing a listwise ranking method LUSAR that maps local scores to global rankings. Our LUSAR demonstrates significant improvements in MLLM's performance on the MESE task, marking the first use of generative MLLM for ESE tasks and extending the applicability of listwise ranking.
