Table of Contents
Fetching ...

Image is All You Need: Towards Efficient and Effective Large Language Model-Based Recommender Systems

Kibum Kim, Sein Kim, Hongseok Kang, Jiwan Kim, Heewoong Noh, Yeonjun In, Kanghoon Yoon, Jinoh Oh, Chanyoung Park

TL;DR

This work tackles the efficiency–effectiveness tension in LLM-based recommender systems by proposing I-LLMRec, which represents items via images rather than lengthy textual descriptions. It introduces a vision-to-language adaptor (M), an Image-LLM Alignment (ILA) module to bridge spaces, and an Image-based Retrieval (IRE) module to ground recommendations in an image-driven shared space, while keeping the LLM frozen for efficiency. Empirical results on four Amazon domains show that I-LLMRec significantly improves inference speed (about 2.93x faster than description-based methods) and boosts accuracy versus attribute-based baselines (roughly 22% gain), with added robustness to noisy descriptions. The approach demonstrates strong performance across varying history lengths, context budgets, and even missing-image scenarios, highlighting its practical impact for scalable, reliable LLM-based recommendations.

Abstract

Large Language Models (LLMs) have recently emerged as a powerful backbone for recommender systems. Existing LLM-based recommender systems take two different approaches for representing items in natural language, i.e., Attribute-based Representation and Description-based Representation. In this work, we aim to address the trade-off between efficiency and effectiveness that these two approaches encounter, when representing items consumed by users. Based on our interesting observation that there is a significant information overlap between images and descriptions associated with items, we propose a novel method, Image is all you need for LLM-based Recommender system (I-LLMRec). Our main idea is to leverage images as an alternative to lengthy textual descriptions for representing items, aiming at reducing token usage while preserving the rich semantic information of item descriptions. Through extensive experiments, we demonstrate that I-LLMRec outperforms existing methods in both efficiency and effectiveness by leveraging images. Moreover, a further appeal of I-LLMRec is its ability to reduce sensitivity to noise in descriptions, leading to more robust recommendations.

Image is All You Need: Towards Efficient and Effective Large Language Model-Based Recommender Systems

TL;DR

This work tackles the efficiency–effectiveness tension in LLM-based recommender systems by proposing I-LLMRec, which represents items via images rather than lengthy textual descriptions. It introduces a vision-to-language adaptor (M), an Image-LLM Alignment (ILA) module to bridge spaces, and an Image-based Retrieval (IRE) module to ground recommendations in an image-driven shared space, while keeping the LLM frozen for efficiency. Empirical results on four Amazon domains show that I-LLMRec significantly improves inference speed (about 2.93x faster than description-based methods) and boosts accuracy versus attribute-based baselines (roughly 22% gain), with added robustness to noisy descriptions. The approach demonstrates strong performance across varying history lengths, context budgets, and even missing-image scenarios, highlighting its practical impact for scalable, reliable LLM-based recommendations.

Abstract

Large Language Models (LLMs) have recently emerged as a powerful backbone for recommender systems. Existing LLM-based recommender systems take two different approaches for representing items in natural language, i.e., Attribute-based Representation and Description-based Representation. In this work, we aim to address the trade-off between efficiency and effectiveness that these two approaches encounter, when representing items consumed by users. Based on our interesting observation that there is a significant information overlap between images and descriptions associated with items, we propose a novel method, Image is all you need for LLM-based Recommender system (I-LLMRec). Our main idea is to leverage images as an alternative to lengthy textual descriptions for representing items, aiming at reducing token usage while preserving the rich semantic information of item descriptions. Through extensive experiments, we demonstrate that I-LLMRec outperforms existing methods in both efficiency and effectiveness by leveraging images. Moreover, a further appeal of I-LLMRec is its ability to reduce sensitivity to noise in descriptions, leading to more robust recommendations.

Paper Structure

This paper contains 35 sections, 8 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: (a) Histogram of the input token length required to represent a user's item interaction history across different item expression approaches. (b) Recommendation performance (Hit@5) and Inference Time (seconds/100 users) for different item representation approaches. We use the Amazon Sports dataset for analysis.
  • Figure 2: Cosine similarity between item image-description pairs in Amazon Sport and Art datasets and image-caption pairs in the COCO dataset using CLIP radford2021learning.
  • Figure 3: Overall framework of I-LLMRec. User-interacted item images are mapped into the LLM through an adaptor, which bridges the image and language spaces. To ensure alignment between two spaces, the adaptor is optimized via the ILA module. Furthermore, the recommendation process is formulated as a retrieval task via the IRE module.
  • Figure 4: Inference time (seconds/100 users) over the length of users' item interaction sequence (i.e., $|\mathcal{S}_u|$).
  • Figure 5: Performance of various LLM context window sizes.
  • ...and 6 more figures