CUE-M: Contextual Understanding and Enhanced Search with Multimodal Large Language Model
Dongyoung Go, Taesun Whang, Chanhee Lee, Hwa-Yeon Kim, Sunghoon Park, Seunghwan Ji, Jinho Kim, Dongchan Kim, Young-Bum Kim
TL;DR
CUE-M tackles the core problem of robust multimodal information retrieval by combining Retrieval-Augmented Generation with a multi-stage pipeline that understands image context and user intent. It introduces image-enriched retrieval, intent refinement, contextual query generation, API integration, and a comprehensive safety filter, all coordinated by an automatic prompt-tuning mechanism. Empirical results on real-world and public benchmarks show CUE-M outperforms baselines in knowledge-heavy queries while maintaining safety performance on par with existing models. The work demonstrates the practical potential of a service-level, multimodal RAG framework for complex, real-world information needs where external knowledge access and safety are critical.
Abstract
The integration of Retrieval-Augmented Generation (RAG) with Multimodal Large Language Models (MLLMs) has revolutionized information retrieval and expanded the practical applications of AI. However, current systems struggle in accurately interpreting user intent, employing diverse retrieval strategies, and effectively filtering unintended or inappropriate responses, limiting their effectiveness. This paper introduces Contextual Understanding and Enhanced Search with MLLM (CUE-M), a novel multimodal search framework that addresses these challenges through a multi-stage pipeline comprising image context enrichment, intent refinement, contextual query generation, external API integration, and relevance-based filtering. CUE-M incorporates a robust filtering pipeline combining image-based, text-based, and multimodal classifiers, dynamically adapting to instance- and category-specific concern defined by organizational policies. Extensive experiments on real-word datasets and public benchmarks on knowledge-based VQA and safety demonstrated that CUE-M outperforms baselines and establishes new state-of-the-art results, advancing the capabilities of multimodal retrieval systems.
