Table of Contents
Fetching ...

X-Reflect: Cross-Reflection Prompting for Multimodal Recommendation

Hanjia Lyu, Ryan Rossi, Xiang Chen, Md Mehrab Tanjim, Stefano Petrangeli, Somdeb Sarkhel, Jiebo Luo

TL;DR

X-Reflect introduces cross-reflection prompting to jointly reason over text and image with Multimodal Large Language Models, generating discrepancy-aware item representations for recommendations. It outperforms text-only and standard multimodal prompts on MovieLens-1M and Amazon-Software, with notable improvements in $NDCG@10$ and related metrics, and it reveals a U-shaped relationship between text-image dissimilarity and performance. The paper also presents X-Reflect-keyword, a latency-efficient variant that halves input token length while maintaining competitive accuracy, highlighting practical deployment benefits in real-time systems. Overall, cross-modal reasoning proves a powerful mechanism for bridging visual and textual signals in recommendations, with adaptive prompting strategies and efficiency-focused variants enabling scalable real-world use.

Abstract

Large Language Models (LLMs) have been shown to enhance the effectiveness of enriching item descriptions, thereby improving the accuracy of recommendation systems. However, most existing approaches either rely on text-only prompting or employ basic multimodal strategies that do not fully exploit the complementary information available from both textual and visual modalities. This paper introduces a novel framework, Cross-Reflection Prompting, termed X-Reflect, designed to address these limitations by prompting Multimodal Large Language Models (MLLMs) to explicitly identify and reconcile supportive and conflicting information between text and images. By capturing nuanced insights from both modalities, this approach generates more comprehensive and contextually rich item representations. Extensive experiments conducted on two widely used benchmarks demonstrate that our method outperforms existing prompting baselines in downstream recommendation accuracy. Furthermore, we identify a U-shaped relationship between text-image dissimilarity and recommendation performance, suggesting the benefit of applying multimodal prompting selectively. To support efficient real-time inference, we also introduce X-Reflect-keyword, a lightweight variant that summarizes image content using keywords and replaces the base model with a smaller backbone, achieving nearly 50% reduction in input length while maintaining competitive performance. This work underscores the importance of integrating multimodal information and presents an effective solution for improving item understanding in multimodal recommendation systems.

X-Reflect: Cross-Reflection Prompting for Multimodal Recommendation

TL;DR

X-Reflect introduces cross-reflection prompting to jointly reason over text and image with Multimodal Large Language Models, generating discrepancy-aware item representations for recommendations. It outperforms text-only and standard multimodal prompts on MovieLens-1M and Amazon-Software, with notable improvements in and related metrics, and it reveals a U-shaped relationship between text-image dissimilarity and performance. The paper also presents X-Reflect-keyword, a latency-efficient variant that halves input token length while maintaining competitive accuracy, highlighting practical deployment benefits in real-time systems. Overall, cross-modal reasoning proves a powerful mechanism for bridging visual and textual signals in recommendations, with adaptive prompting strategies and efficiency-focused variants enabling scalable real-world use.

Abstract

Large Language Models (LLMs) have been shown to enhance the effectiveness of enriching item descriptions, thereby improving the accuracy of recommendation systems. However, most existing approaches either rely on text-only prompting or employ basic multimodal strategies that do not fully exploit the complementary information available from both textual and visual modalities. This paper introduces a novel framework, Cross-Reflection Prompting, termed X-Reflect, designed to address these limitations by prompting Multimodal Large Language Models (MLLMs) to explicitly identify and reconcile supportive and conflicting information between text and images. By capturing nuanced insights from both modalities, this approach generates more comprehensive and contextually rich item representations. Extensive experiments conducted on two widely used benchmarks demonstrate that our method outperforms existing prompting baselines in downstream recommendation accuracy. Furthermore, we identify a U-shaped relationship between text-image dissimilarity and recommendation performance, suggesting the benefit of applying multimodal prompting selectively. To support efficient real-time inference, we also introduce X-Reflect-keyword, a lightweight variant that summarizes image content using keywords and replaces the base model with a smaller backbone, achieving nearly 50% reduction in input length while maintaining competitive performance. This work underscores the importance of integrating multimodal information and presents an effective solution for improving item understanding in multimodal recommendation systems.
Paper Structure (30 sections, 4 equations, 2 figures, 5 tables)

This paper contains 30 sections, 4 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Cross-modal information is most beneficial in multimodal prompting when there is a moderate degree of misalignment between the text and image modalities, as measured by the cosine dissimilarity between the original item description embeddings and the embeddings of the text generated via the prompting strategy. It is consistent across three different multimodal prompting strategies including Rec-GPT4V liu2024rec, CoT, and our framework X-Reflect. This trend is examined using NDCG@10, a ranking-based metric where higher values indicate better top-10 recommendation quality by prioritizing relevant items near the top of the list. Further details can be found in Section \ref{['appendix_sec:dissimilarity']}.
  • Figure 2: The image associated with an item can provide additional valuable information. However, existing prompting frameworks that directly instruct MLLMs to describe the image, even when both text and image are provided, may not fully leverage the potential to enrich the output with sufficient useful details. The image descriptions (highlighted in green) and augmented text (highlighted in blue) generated through these frameworks tend to be highly similar, providing minimal additional information. By instructing MLLMs to determine whether the text and image support or contradict each other, we can encourage the models to explore and integrate more information from both modalities (highlighted in purple). Due to space constraints, the full responses are provided in Appendix \ref{['appendix_sec:example_response']}.