Table of Contents
Fetching ...

Qilin: A Multimodal Information Retrieval Dataset with APP-level User Sessions

Jia Chen, Qian Dong, Haitao Li, Xiaohui He, Yan Gao, Shaosheng Cao, Yi Wu, Ping Yang, Chen Xu, Yao Hu, Qingyao Ai, Yiqun Liu

TL;DR

Qilin addresses the scarcity of realistic multimodal information retrieval data by collecting APP-level sessions from Xiaohongshu, including heterogeneous results and direct DQA interactions. It introduces a comprehensive data construction pipeline, scale statistics, and a rich data schema, enabling rigorous study of multimodal retrieval and retrieval-augmented generation (RAG) workflows. The accompanying analyses reveal user demographics, engagement patterns, and cross-service transitions, while baseline experiments demonstrate the value of incorporating images, video frames, and DQA signals in both retrieval and generation tasks. Overall, Qilin offers a practical benchmark and actionable insights for building more effective, user-aware multimodal S&R systems and for evaluating LLM-based RAG approaches in realistic settings.

Abstract

User-generated content (UGC) communities, especially those featuring multimodal content, improve user experiences by integrating visual and textual information into results (or items). The challenge of improving user experiences in complex systems with search and recommendation (S\&R) services has drawn significant attention from both academia and industry these years. However, the lack of high-quality datasets has limited the research progress on multimodal S\&R. To address the growing need for developing better S\&R services, we present a novel multimodal information retrieval dataset in this paper, namely Qilin. The dataset is collected from Xiaohongshu, a popular social platform with over 300 million monthly active users and an average search penetration rate of over 70\%. In contrast to existing datasets, \textsf{Qilin} offers a comprehensive collection of user sessions with heterogeneous results like image-text notes, video notes, commercial notes, and direct answers, facilitating the development of advanced multimodal neural retrieval models across diverse task settings. To better model user satisfaction and support the analysis of heterogeneous user behaviors, we also collect extensive APP-level contextual signals and genuine user feedback. Notably, Qilin contains user-favored answers and their referred results for search requests triggering the Deep Query Answering (DQA) module. This allows not only the training \& evaluation of a Retrieval-augmented Generation (RAG) pipeline, but also the exploration of how such a module would affect users' search behavior. Through comprehensive analysis and experiments, we provide interesting findings and insights for further improving S\&R systems. We hope that \textsf{Qilin} will significantly contribute to the advancement of multimodal content platforms with S\&R services in the future.

Qilin: A Multimodal Information Retrieval Dataset with APP-level User Sessions

TL;DR

Qilin addresses the scarcity of realistic multimodal information retrieval data by collecting APP-level sessions from Xiaohongshu, including heterogeneous results and direct DQA interactions. It introduces a comprehensive data construction pipeline, scale statistics, and a rich data schema, enabling rigorous study of multimodal retrieval and retrieval-augmented generation (RAG) workflows. The accompanying analyses reveal user demographics, engagement patterns, and cross-service transitions, while baseline experiments demonstrate the value of incorporating images, video frames, and DQA signals in both retrieval and generation tasks. Overall, Qilin offers a practical benchmark and actionable insights for building more effective, user-aware multimodal S&R systems and for evaluating LLM-based RAG approaches in realistic settings.

Abstract

User-generated content (UGC) communities, especially those featuring multimodal content, improve user experiences by integrating visual and textual information into results (or items). The challenge of improving user experiences in complex systems with search and recommendation (S\&R) services has drawn significant attention from both academia and industry these years. However, the lack of high-quality datasets has limited the research progress on multimodal S\&R. To address the growing need for developing better S\&R services, we present a novel multimodal information retrieval dataset in this paper, namely Qilin. The dataset is collected from Xiaohongshu, a popular social platform with over 300 million monthly active users and an average search penetration rate of over 70\%. In contrast to existing datasets, \textsf{Qilin} offers a comprehensive collection of user sessions with heterogeneous results like image-text notes, video notes, commercial notes, and direct answers, facilitating the development of advanced multimodal neural retrieval models across diverse task settings. To better model user satisfaction and support the analysis of heterogeneous user behaviors, we also collect extensive APP-level contextual signals and genuine user feedback. Notably, Qilin contains user-favored answers and their referred results for search requests triggering the Deep Query Answering (DQA) module. This allows not only the training \& evaluation of a Retrieval-augmented Generation (RAG) pipeline, but also the exploration of how such a module would affect users' search behavior. Through comprehensive analysis and experiments, we provide interesting findings and insights for further improving S\&R systems. We hope that \textsf{Qilin} will significantly contribute to the advancement of multimodal content platforms with S\&R services in the future.

Paper Structure

This paper contains 18 sections, 2 equations, 3 figures, 11 tables.

Figures (3)

  • Figure 1: Xiaohongshu leverages a two-column result list for S&R services, retrieving heterogeneous results like image-text, video, and commercial notes. The search system is equipped with a DQA module to provide direct answers for users. There are also various modules to stimulate users to search for any topics they might be interested in.
  • Figure 2: The data construction process of Qilin. The front-end log is joined with sampled user IDs to obtain the dataset backbone. Then we collect features for the request, user, and note from various databases. Finally, all content features undergo rigorous filtering by LLMs and human experts.
  • Figure 3: Position bias, session bias, result type distribution, and CTR for two result types w.r.t. ranking positions in search and recommendation scenarios.