Table of Contents
Fetching ...

Multimodal Generative Retrieval Model with Staged Pretraining for Food Delivery on Meituan

Boyu Chen, Tai Guo, Weiyu Cui, Yuqing Li, Xingxing Wang, Chuan Shi, Cheng Yang

TL;DR

This work tackles multimodal retrieval for food delivery by identifying and addressing modality dominance and the one-epoch problem in joint training. It proposes SMGR, a three-component framework combining staged pretraining, residual-quantized semantic IDs (SIDs), and SID-oriented generative/discriminative fine-tuning to better utilize multimodal features and enable efficient deployment. Offline results on large Meituan data show consistent gains over strong baselines in Recall and NDCG, while online A/B testing reports revenue and CTR uplifts, validating practical impact. The approach offers a scalable blueprint for integrating multimodal signals in real-world retrieval systems, balancing accuracy with deployment efficiency.

Abstract

Multimodal retrieval models are becoming increasingly important in scenarios such as food delivery, where rich multimodal features can meet diverse user needs and enable precise retrieval. Mainstream approaches typically employ a dual-tower architecture between queries and items, and perform joint optimization of intra-tower and inter-tower tasks. However, we observe that joint optimization often leads to certain modalities dominating the training process, while other modalities are neglected. In addition, inconsistent training speeds across modalities can easily result in the one-epoch problem. To address these challenges, we propose a staged pretraining strategy, which guides the model to focus on specialized tasks at each stage, enabling it to effectively attend to and utilize multimodal features, and allowing flexible control over the training process at each stage to avoid the one-epoch problem. Furthermore, to better utilize the semantic IDs that compress high-dimensional multimodal embeddings, we design both generative and discriminative tasks to help the model understand the associations between SIDs, queries, and item features, thereby improving overall performance. Extensive experiments on large-scale real-world Meituan data demonstrate that our method achieves improvements of 3.80%, 2.64%, and 2.17% on R@5, R@10, and R@20, and 5.10%, 4.22%, and 2.09% on N@5, N@10, and N@20 compared to mainstream baselines. Online A/B testing on the Meituan platform shows that our approach achieves a 1.12% increase in revenue and a 1.02% increase in click-through rate, validating the effectiveness and superiority of our method in practical applications.

Multimodal Generative Retrieval Model with Staged Pretraining for Food Delivery on Meituan

TL;DR

This work tackles multimodal retrieval for food delivery by identifying and addressing modality dominance and the one-epoch problem in joint training. It proposes SMGR, a three-component framework combining staged pretraining, residual-quantized semantic IDs (SIDs), and SID-oriented generative/discriminative fine-tuning to better utilize multimodal features and enable efficient deployment. Offline results on large Meituan data show consistent gains over strong baselines in Recall and NDCG, while online A/B testing reports revenue and CTR uplifts, validating practical impact. The approach offers a scalable blueprint for integrating multimodal signals in real-world retrieval systems, balancing accuracy with deployment efficiency.

Abstract

Multimodal retrieval models are becoming increasingly important in scenarios such as food delivery, where rich multimodal features can meet diverse user needs and enable precise retrieval. Mainstream approaches typically employ a dual-tower architecture between queries and items, and perform joint optimization of intra-tower and inter-tower tasks. However, we observe that joint optimization often leads to certain modalities dominating the training process, while other modalities are neglected. In addition, inconsistent training speeds across modalities can easily result in the one-epoch problem. To address these challenges, we propose a staged pretraining strategy, which guides the model to focus on specialized tasks at each stage, enabling it to effectively attend to and utilize multimodal features, and allowing flexible control over the training process at each stage to avoid the one-epoch problem. Furthermore, to better utilize the semantic IDs that compress high-dimensional multimodal embeddings, we design both generative and discriminative tasks to help the model understand the associations between SIDs, queries, and item features, thereby improving overall performance. Extensive experiments on large-scale real-world Meituan data demonstrate that our method achieves improvements of 3.80%, 2.64%, and 2.17% on R@5, R@10, and R@20, and 5.10%, 4.22%, and 2.09% on N@5, N@10, and N@20 compared to mainstream baselines. Online A/B testing on the Meituan platform shows that our approach achieves a 1.12% increase in revenue and a 1.02% increase in click-through rate, validating the effectiveness and superiority of our method in practical applications.
Paper Structure (23 sections, 14 equations, 4 figures, 5 tables)

This paper contains 23 sections, 14 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: (a) Loss trends of different objectives during joint optimization. (b) Performance comparison between the jointly optimized model and the model with randomly generated image embeddings.
  • Figure 2: The overall framework of our proposed model with three principal components. 1) Staged Pretraining: High-quality multimodal embeddings are obtained through staged pretraining. 2) Semantic IDs Generation: High-dimensional embeddings are transformed into discrete SIDs to alleviate deployment burden. 3) Semantic IDs Utilization: The model is fine-tuned to adapt to SIDs, thereby enhancing downstream task performance.
  • Figure 3: The Top-2 retrieved items for the query "Peking Duck" produced by different models: (a) baseline Joint-Que2search; (b) our proposed SMGR.
  • Figure 4: Examples of prompt templates used during training, fine-tuning, and inference.