Table of Contents
Fetching ...

Efficient Multimodal Streaming Recommendation via Expandable Side Mixture-of-Experts

Yunke Qu, Liang Qu, Tong Chen, Quoc Viet Hung Nguyen, Hongzhi Yin

TL;DR

XSMoE tackles the challenge of multimodal streaming recommendations by attaching memory-efficient side-tuning modules to frozen ViT and BERT encoders, and progressively expanding an expandable Mixture-of-Experts (MoE) per layer as new time windows arrive. A lightweight router blends backbone and expert outputs, while a utilization-based pruning strategy controls growth and maintains efficiency, allowing the model to preserve long-term preferences while learning new modality-specific patterns. Empirical results across three real-world datasets show XSMoE delivering superior recommendation quality (HR@10 and NDCG@10) with favorable training time and memory characteristics, and ablations confirm the value of expansion, pruning, and joint multimodal optimization. The work demonstrates a practical pathway to adaptable, continual multimodal SRSs that balance performance with computational efficiency in streaming environments, with future work extending to integrating ID embeddings for further gains.

Abstract

Streaming recommender systems (SRSs) are widely deployed in real-world applications, where user interests shift and new items arrive over time. As a result, effectively capturing users' latest preferences is challenging, as interactions reflecting recent interests are limited and new items often lack sufficient feedback. A common solution is to enrich item representations using multimodal encoders (e.g., BERT or ViT) to extract visual and textual features. However, these encoders are pretrained on general-purpose tasks: they are not tailored to user preference modeling, and they overlook the fact that user tastes toward modality-specific features such as visual styles and textual tones can also drift over time. This presents two key challenges in streaming scenarios: the high cost of fine-tuning large multimodal encoders, and the risk of forgetting long-term user preferences due to continuous model updates. To tackle these challenges, we propose Expandable Side Mixture-of-Experts (XSMoE), a memory-efficient framework for multimodal streaming recommendation. XSMoE attaches lightweight side-tuning modules consisting of expandable expert networks to frozen pretrained encoders and incrementally expands them in response to evolving user feedback. A gating router dynamically combines expert and backbone outputs, while a utilization-based pruning strategy maintains model compactness. By learning new patterns through expandable experts without overwriting previously acquired knowledge, XSMoE effectively captures both cold start and shifting preferences in multimodal features. Experiments on three real-world datasets demonstrate that XSMoE outperforms state-of-the-art baselines in both recommendation quality and computational efficiency.

Efficient Multimodal Streaming Recommendation via Expandable Side Mixture-of-Experts

TL;DR

XSMoE tackles the challenge of multimodal streaming recommendations by attaching memory-efficient side-tuning modules to frozen ViT and BERT encoders, and progressively expanding an expandable Mixture-of-Experts (MoE) per layer as new time windows arrive. A lightweight router blends backbone and expert outputs, while a utilization-based pruning strategy controls growth and maintains efficiency, allowing the model to preserve long-term preferences while learning new modality-specific patterns. Empirical results across three real-world datasets show XSMoE delivering superior recommendation quality (HR@10 and NDCG@10) with favorable training time and memory characteristics, and ablations confirm the value of expansion, pruning, and joint multimodal optimization. The work demonstrates a practical pathway to adaptable, continual multimodal SRSs that balance performance with computational efficiency in streaming environments, with future work extending to integrating ID embeddings for further gains.

Abstract

Streaming recommender systems (SRSs) are widely deployed in real-world applications, where user interests shift and new items arrive over time. As a result, effectively capturing users' latest preferences is challenging, as interactions reflecting recent interests are limited and new items often lack sufficient feedback. A common solution is to enrich item representations using multimodal encoders (e.g., BERT or ViT) to extract visual and textual features. However, these encoders are pretrained on general-purpose tasks: they are not tailored to user preference modeling, and they overlook the fact that user tastes toward modality-specific features such as visual styles and textual tones can also drift over time. This presents two key challenges in streaming scenarios: the high cost of fine-tuning large multimodal encoders, and the risk of forgetting long-term user preferences due to continuous model updates. To tackle these challenges, we propose Expandable Side Mixture-of-Experts (XSMoE), a memory-efficient framework for multimodal streaming recommendation. XSMoE attaches lightweight side-tuning modules consisting of expandable expert networks to frozen pretrained encoders and incrementally expands them in response to evolving user feedback. A gating router dynamically combines expert and backbone outputs, while a utilization-based pruning strategy maintains model compactness. By learning new patterns through expandable experts without overwriting previously acquired knowledge, XSMoE effectively captures both cold start and shifting preferences in multimodal features. Experiments on three real-world datasets demonstrate that XSMoE outperforms state-of-the-art baselines in both recommendation quality and computational efficiency.

Paper Structure

This paper contains 29 sections, 8 equations, 2 figures, 3 tables, 1 algorithm.

Figures (2)

  • Figure 1: The model structure after one expansion. The visual encoder ViT and the textual encoder BERT are dashed because we use their precomputed outputs. The actual models are never loaded into memory afterwards. TMR denotes a Transformer layer. G denotes the router. In this example, every two Transformer layers are grouped together so the number of layers in the side-tuning network is halved.
  • Figure 2: Sensitivity analysis of the hyperparameter $\tau$ w.r.t. HR@10 and NDCG@10 on the Amazon Home, Amazon Electronics, and H&M datasets.