Table of Contents
Fetching ...

IISAN: Efficiently Adapting Multimodal Representation for Sequential Recommendation with Decoupled PEFT

Junchen Fu, Xuri Ge, Xin Xin, Alexandros Karatzoglou, Ioannis Arapakis, Jie Wang, Joemon M. Jose

TL;DR

This work tackles the practical inefficiency of adapting large multimodal foundation models for sequential recommendation. It proposes IISAN, a Decoupled PEFT architecture with separate intra- and inter-modal self-attention networks, plus a caching strategy and LayerDrop, to dramatically reduce GPU memory and training time while maintaining or improving accuracy relative to full fine-tuning and traditional PEFT methods. A new composite metric, TPME, combines training time, parameter count, and GPU memory to reflect real-world efficiency. Across three multimodal datasets and multiple backbone configurations, IISAN achieves comparable or superior recommendation performance with substantially better efficiency, and ablations confirm the value of each component. The work provides open-source code and presents a practical framework for efficient multimodal representation learning in sequential recommendation.

Abstract

Multimodal foundation models are transformative in sequential recommender systems, leveraging powerful representation learning capabilities. While Parameter-efficient Fine-tuning (PEFT) is commonly used to adapt foundation models for recommendation tasks, most research prioritizes parameter efficiency, often overlooking critical factors like GPU memory efficiency and training speed. Addressing this gap, our paper introduces IISAN (Intra- and Inter-modal Side Adapted Network for Multimodal Representation), a simple plug-and-play architecture using a Decoupled PEFT structure and exploiting both intra- and inter-modal adaptation. IISAN matches the performance of full fine-tuning (FFT) and state-of-the-art PEFT. More importantly, it significantly reduces GPU memory usage - from 47GB to just 3GB for multimodal sequential recommendation tasks. Additionally, it accelerates training time per epoch from 443s to 22s compared to FFT. This is also a notable improvement over the Adapter and LoRA, which require 37-39 GB GPU memory and 350-380 seconds per epoch for training. Furthermore, we propose a new composite efficiency metric, TPME (Training-time, Parameter, and GPU Memory Efficiency) to alleviate the prevalent misconception that "parameter efficiency represents overall efficiency". TPME provides more comprehensive insights into practical efficiency comparisons between different methods. Besides, we give an accessible efficiency analysis of all PEFT and FFT approaches, which demonstrate the superiority of IISAN. We release our codes and other materials at https://github.com/GAIR-Lab/IISAN.

IISAN: Efficiently Adapting Multimodal Representation for Sequential Recommendation with Decoupled PEFT

TL;DR

This work tackles the practical inefficiency of adapting large multimodal foundation models for sequential recommendation. It proposes IISAN, a Decoupled PEFT architecture with separate intra- and inter-modal self-attention networks, plus a caching strategy and LayerDrop, to dramatically reduce GPU memory and training time while maintaining or improving accuracy relative to full fine-tuning and traditional PEFT methods. A new composite metric, TPME, combines training time, parameter count, and GPU memory to reflect real-world efficiency. Across three multimodal datasets and multiple backbone configurations, IISAN achieves comparable or superior recommendation performance with substantially better efficiency, and ablations confirm the value of each component. The work provides open-source code and presents a practical framework for efficient multimodal representation learning in sequential recommendation.

Abstract

Multimodal foundation models are transformative in sequential recommender systems, leveraging powerful representation learning capabilities. While Parameter-efficient Fine-tuning (PEFT) is commonly used to adapt foundation models for recommendation tasks, most research prioritizes parameter efficiency, often overlooking critical factors like GPU memory efficiency and training speed. Addressing this gap, our paper introduces IISAN (Intra- and Inter-modal Side Adapted Network for Multimodal Representation), a simple plug-and-play architecture using a Decoupled PEFT structure and exploiting both intra- and inter-modal adaptation. IISAN matches the performance of full fine-tuning (FFT) and state-of-the-art PEFT. More importantly, it significantly reduces GPU memory usage - from 47GB to just 3GB for multimodal sequential recommendation tasks. Additionally, it accelerates training time per epoch from 443s to 22s compared to FFT. This is also a notable improvement over the Adapter and LoRA, which require 37-39 GB GPU memory and 350-380 seconds per epoch for training. Furthermore, we propose a new composite efficiency metric, TPME (Training-time, Parameter, and GPU Memory Efficiency) to alleviate the prevalent misconception that "parameter efficiency represents overall efficiency". TPME provides more comprehensive insights into practical efficiency comparisons between different methods. Besides, we give an accessible efficiency analysis of all PEFT and FFT approaches, which demonstrate the superiority of IISAN. We release our codes and other materials at https://github.com/GAIR-Lab/IISAN.
Paper Structure (16 sections, 16 equations, 4 figures, 7 tables)

This paper contains 16 sections, 16 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Comparsions among Full Fine-tuning (FFT), Embedded PEFT and Decoupled PEFT for feature representation learning. The traditional Embedded PEFT (EPEFT), e.g., Adapter houlsby2019parameter and LoRA hu2021lora, embed the additional trainable parameters into the foundation models, reducing trainable parameters but still having heavy computation graph during backpropagation. The proposed IISAN belongs to Decoupled PEFT (DPEFT), which significantly reduces the size of the computation graph by decoupling the PEFT from backbones and maintains the latest trainable parameters by freezing backbones.
  • Figure 2: An Overview of the IISAN for sequential recommendation. The framework takes the pre-trained text encoder BERT devlin2018bert and image encoder ViT dosovitskiy2020image as an example, which contains 12 Transformer-blocks (TRMs) respectively. IISAN proposes intra- and inter-modal side adapted networks, where the intra-modal SANs mainly construct independent adaptive representation learning within two modalities and the inter-modal SAN focuses on the efficient multimodal interactions between layer hidden states in multimodal networks. SANs consist of multiple SAN blocks (SANBs) and learnable fusion gates. Each SANB receives the hidden states from the corresponding layers and makes an adaptive learning optimization for the final recommendation task by a unified objective function. Notablely, we leverage LayerDrop to further omit redundancy.
  • Figure 3: Caching strategies comparison. The input for the DPEFT remains constant and, in theory, can be cached. On the other hand, the input for the EPEFT is subject to change as it is influenced by parameter updates in the last block.
  • Figure 4: Peformance comparisons between FFT and IISAN with different multimodal backbones on Scientific dataset.