Table of Contents
Fetching ...

ATFLRec: A Multimodal Recommender System with Audio-Text Fusion and Low-Rank Adaptation via Instruction-Tuned Large Language Model

Zezheng Qin

TL;DR

The paper addresses cold-start and efficiency challenges in recommender systems by integrating audio and text modalities into an instruction-tuned large language model using Low-Rank Adaptation (LoRA) for parameter-efficient fine-tuning. It introduces ATFLRec, a multimodal framework that explores multiple LoRA configurations and fusion strategies, including separate fine-tuning for audio and text and Mel-filterbank-based audio embeddings. Empirical results on the MicroLens dataset show that ATFLRec outperforms traditional deep and graph-based baselines in few-shot settings, with optimal performance when audio and text LoRA modules are trained separately and fused, particularly using maximum pooling and 80 FBanks. The study provides actionable insights into multimodal data fusion, LoRA-basedfine-tuning, and instruction-tuned LLMs for scalable, personalized recommendations with limited training data.

Abstract

Recommender Systems (RS) play a pivotal role in boosting user satisfaction by providing personalized product suggestions in domains such as e-commerce and entertainment. This study examines the integration of multimodal data text and audio into large language models (LLMs) with the aim of enhancing recommendation performance. Traditional text and audio recommenders encounter limitations such as the cold-start problem, and recent advancements in LLMs, while promising, are computationally expensive. To address these issues, Low-Rank Adaptation (LoRA) is introduced, which enhances efficiency without compromising performance. The ATFLRec framework is proposed to integrate audio and text modalities into a multimodal recommendation system, utilizing various LoRA configurations and modality fusion techniques. Results indicate that ATFLRec outperforms baseline models, including traditional and graph neural network-based approaches, achieving higher AUC scores. Furthermore, separate fine-tuning of audio and text data with distinct LoRA modules yields optimal performance, with different pooling methods and Mel filter bank numbers significantly impacting performance. This research offers valuable insights into optimizing multimodal recommender systems and advancing the integration of diverse data modalities in LLMs.

ATFLRec: A Multimodal Recommender System with Audio-Text Fusion and Low-Rank Adaptation via Instruction-Tuned Large Language Model

TL;DR

The paper addresses cold-start and efficiency challenges in recommender systems by integrating audio and text modalities into an instruction-tuned large language model using Low-Rank Adaptation (LoRA) for parameter-efficient fine-tuning. It introduces ATFLRec, a multimodal framework that explores multiple LoRA configurations and fusion strategies, including separate fine-tuning for audio and text and Mel-filterbank-based audio embeddings. Empirical results on the MicroLens dataset show that ATFLRec outperforms traditional deep and graph-based baselines in few-shot settings, with optimal performance when audio and text LoRA modules are trained separately and fused, particularly using maximum pooling and 80 FBanks. The study provides actionable insights into multimodal data fusion, LoRA-basedfine-tuning, and instruction-tuned LLMs for scalable, personalized recommendations with limited training data.

Abstract

Recommender Systems (RS) play a pivotal role in boosting user satisfaction by providing personalized product suggestions in domains such as e-commerce and entertainment. This study examines the integration of multimodal data text and audio into large language models (LLMs) with the aim of enhancing recommendation performance. Traditional text and audio recommenders encounter limitations such as the cold-start problem, and recent advancements in LLMs, while promising, are computationally expensive. To address these issues, Low-Rank Adaptation (LoRA) is introduced, which enhances efficiency without compromising performance. The ATFLRec framework is proposed to integrate audio and text modalities into a multimodal recommendation system, utilizing various LoRA configurations and modality fusion techniques. Results indicate that ATFLRec outperforms baseline models, including traditional and graph neural network-based approaches, achieving higher AUC scores. Furthermore, separate fine-tuning of audio and text data with distinct LoRA modules yields optimal performance, with different pooling methods and Mel filter bank numbers significantly impacting performance. This research offers valuable insights into optimizing multimodal recommender systems and advancing the integration of diverse data modalities in LLMs.
Paper Structure (17 sections, 4 figures, 5 tables)